Hacker News new | past | comments | ask | show | jobs | submit login
Foundations of probability theory (terrytao.wordpress.com)
94 points by websec on Oct 2, 2015 | hide | past | favorite | 49 comments



I don't know what mathematicians think of it, but I enjoyed "Probability Theory: The Logic of Science".

https://www.google.com/#q=probability+the+logic+of+science


Its take on probability as a way of modelling our brain is refreshing, for instance, the beginning of chapter 1:

Suppose some dark night a policeman walks down a street, apparently deserted; but suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn't hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion? Let us first take a leisurely look at the general nature of such problems.

A moment's thought makes it clear that our policeman's conclusion was not a logical deduction from the evidence; for there may have been a perfectly innocent explanation for everything. It might be, for example, that this gentleman was the owner of the jewelry store and he was coming home from a masquerade party, and didn't have the key with him. But just as he walked by his store a passing truck threw a stone through the window; and he was only protecting his own property. Now while the policeman's reasoning process was not logical deduction, we will grant that it had a certain degree of validity. The evidence did not make the gentleman's dishonesty certain, but it did make it extremely plausible. This is an example of a kind of reasoning in which we haveall become more or less proficient, necessarily, long before studying mathematical theories. We are hardly able to get through one waking hour without facing some situation (e.g. will it rain or won't it?) where we do not have enough information to permit deductive reasoning; but still we must decide immediately what to do.


Ah, but policeman's actual thought process might have actually been purely deductive or purely reflexive, it might only be that the probabilities justify a series of a discreet calculations.


This is called abductive reasoning, FYI.

I sometimes joke Sherlock Holmes is really all about "The Science of Abduction" not deduction; a rather more threatening tagline.


It's interesting that starting from the principles he states in section 1.7 the rules of probability can be derived. This formulation is from http://arxiv.org/abs/1507.06597 (where it is shown that the measure-theoretical approach is just an special case):

(Representation) Degrees of plausibility pl(A|B) can be identified as unique real numbers p ∈ R; and

(Qualitative Correspondence With Common Sense; Logical Internal Consistency) This has three sub-Principles:

– If there is more than one path to a correct conclusion, all such paths must lead to the same plausibility result;

– In assessing pl(A|B), You must always use all of the available information that You regard as relevant to the assessment; and

– Equivalent states of information about (A|B) always lead to the same pl(A|B).


I enjoy it, his papers and that books sit proudly on my desk. It's interesting that his collection of papers gives an entirely different approach to his approach to probability theory and hardly focuses on it being an extension of Aristotelean logic.


The fact that Shannon entropy is still relevant in numerous modern mathematics research is amazing.

Apart from minor typos that make him a "sloppy mathematician", he is a good educator based on my personal experience taking grad. courses from him. He's not the passionate high school STEM teacher type, but he offers great insights. I think which textbook he used is somewhat secondary. For courses he had taught before, he usually pick a standard text and teach based on the material he wrote in his blog post, including exercises.


I have a question that might be related to the topic. I have found that often, when solving problems related to probability theory, it is more convenient to think in terms of combinations than to think in terms of probabilities (perhaps because I'm a programmer). Would it be possible to devise a programming language that allows me to program a filter that selects the desired outcomes out of the complete set of combinations, and to have the compiler automatically deduce a closed-form formula for the probability (without enumeration)? If this is not possible in general, what would the restrictions on this programming language be, to make it work in practice?

So for example, given the question what the probability is that, when throwing 4 coins, 2 of which will be heads; I could write a function that generates all possible outcomes "TTTT", "TTTH", etc. And I could write a filter function that returns true for "TTHH", "THTH", "THHT", etc.


This is possible, and in fact probably implemented in some probabilistic programming languages, but I think you are looking at the wrong direction.

The point is that even for fairly simple real use cases, the computation complexity is so huge, that all computers in the world couldn't compute it in your lifetime if you don't employ some approximation or optimization and stick to naive algorithms.

So, that is what the whole field of machine learning is about: finding some clever ways to deal with random variables in a computationally feasible way...


All those probabilistic programming languages will become exponentially faster once we have feasible quantum computers, since BPP \in BQP. We currently use a weaker inclusion, BPP \in PSPACE, as the core execution model.


Check your facts, because this is simply not true.


Most introductory books on probability I have seen open with the combinatorical approach you describe. When it is applicable, this approach is very powerfull. However there is essentially nothing novel in combinatorical probability relative to combinatorics, so research and advanced topics in this field are done under the title of combinatorics, instead of probability. Probability becomes a field in its own right once we move beyond the cases where a combinatorical approach will work.


You might find this interesting: http://blog.plover.com/prog/haskell/probmonad.html

Doesn't produce "closed-form formula", but, well, those tend to scale to programming-sized tasks poorly anyhow.


I think you're looking for Pascal's Triangle.

  let P = probability of 2 heads given 4 coins
  let omega = (x + y)^n     # probability space
  let x = 50%               # P(heads)
  let y = 50%               # P(tails)
  let (n, k) = (4, 2)       # 4 coins, 2 heads

  P = (n, k)(x  )^2(y  )^2
      (4, 2)(50%)^2(50%)^2
      (6   )(25%)  (25%)    # (4, 2) = 6; see link
      (6   )(6.25%)
      (37.5%)
https://en.wikipedia.org/wiki/Pascal%27s_triangle


I think that would only work in the discrete case or for simple problems, however something relevant

>“Information theory must precede probability theory and not be based on it.” A.N.Kolmogorov, in [Kolmogorov, 19831.

and it was a combinatorics paper


Looked at the text to be used, Durrett. There's more and with higher quality in any of M. Loeve, J. Neveu, K. Chung, and L. Breiman.

For what Tao wrote on his page about determinism or whatever on that page -- just f'get about that.

And what he wrote, confusing a sample space and a probability space, just say that he had a bad day that day.


To each their own. Durrett's text is perfectly fine. The books you mention are older, and perhaps you originally learned from them, so you still hold a torch?

In particular, sending people to Loeve for their first course in measure theoretic probability would be really cruel.

Personally, I learned from Feller, and Billingsley, and finally Durrett.


> To each their own. Durrett's text is perfectly fine. The books you mention are older, and perhaps you originally learned from them, so you still hold a torch?

Durrett seems on most of the topics to have less than any of the four texts I mentioned.

> In particular, sending people to Loeve for their first course in measure theoretic probability would be really cruel.

I listed the authors in reverse order in which to read them! The easiest start is the last, Breiman. But some things in Loeve are good, e.g., sufficient statistics. Neveu is my favorite as the most elegant, but his exercises are the most difficult. Breiman and Chung are fine. IIRC, both Breiman and Neveu were Loeve students.

> Personally, I learned from Feller, and Billingsley, and finally Durrett.

I used Feller I and/or II for reference. Otherwise his writing seemed to lack an overall unification.

The only Billingsley text I used, and then only for a small topic, was Convergence of Probability Measures.


Out of curiosity, what's the background that lets you be so dismissive of a Fields Medal winner?

At this point I see a blog post written by a well respected mathematician whom I feel comfortable trusting and it's being brushed aside by I don't know who.


Well you can just read the reviews to see that the Durrett text isn't well regarded while others like Chung's are. And the criticism about a probability space not being a sample space is correct, but I think it's clear what Tao meant there, namely that the sample space would be a part of a probability space.


Thanks

The Amazon UK reviews of Chung's book lead me to A Probability Path by Sidney Resnick which appears to be aimed at non-mathematicians. I have invested (speaking as a renegade physicist lacking a systematic exploration of measure theory).


Being a great mathematician doesn't necessarily make a great math educator. It is odd Tao chose Durret, but I assume it's due to the book being freely available online.


As the blog comments suggest there's a difference between a good self-study book and a good textbook for a class. The book will be accompanied by (at least) what look to be a good set of course notes.

Some googling around suggests his students are quite pleased with him. He has, according to Wikipedia's intro for him, the undisputed king of math blogs. Both of these point to him being at least a good or above average educator.


Background: From a famous, world class research university, Ph.D. from research on stochastic optimal control. Good to see that Durrett's book touches on regular conditional probabilities -- I needed that topic!

I learned from a star student of E. Cinlar, long at Princeton. We used Royden, Rudin, Neveu, and Chung, and there were some nice topics in the course not in any of those texts, e.g., the Lindeberg-Feller version of the central limit theorem, a really nice, astounding, result on an envelope for Brownian motion, more on ergodic theory, some on additive processes, and more. Super nice course.


Are you advocating "proof by authority "?


Of course. Appeal to authority was never a fallacy; the fallacy is "appeal to false authority". "Terry Tao knows much more about math in general and this in particular than I do, so I'll trust what he says here" is completely valid.


A fields medal doesn't indicate general authority in mathematics, only authority with regards to what was necessary for getting the medal.


OK, but being a math professor does indicate general authority in mathematics.


Appeal to authority is a dumb logical fallacy in reality geniuses are better at everything and we should prioritize their opinions by some decent weight over everyone else. Obviously.


And what he wrote, confusing a sample space and a probability space, just say that he had a bad day that day.

Could you expand on this? I'm reading his post now, but from the outside, it seems unlikely that Terry got the concepts completely wrong, and more likely that he's using slightly different definitions for the concepts than you are expecting.


No, he's just being sloppy.

That can be a real pain for students trying to learn. But, be warned: A lot of people work with probability, but only a tiny fraction ever had a course in graduate probability. So, eventually have to learn to put up with, and sometimes rewrite, some of what is written that is not very precise.

Here are the accepted definitions:

Take a non-empty set, usually denoted by Omega, and call it the sample space or the set of trials.

Take a collection of subsets of Omega, usually denoted by script F, and call it the set of events. Have at least enough subsets of Omega in the set of events script F so that script F will be a sigma algebra. So, script F has to have as an element the empty set, be closed under relative complements, and be closed under countable unions.

We want all that for probability theory. We want countably infinite so that we can discuss, say, the event that a coin never or always comes up heads. We don't want uncountably infinite because it would create a big mess in the theory.

Then the ordered pair (Omega, script F) is called a measurable space -- it doesn't have a measure yet but soon will.

On that measurable space, define a positive measure P so that P( Omega ) = 1. Then P is a probability measure on the measurable space.

A probability space is the triple ( Omega, script F, P ).

Intuitively, a measure assigns to each event A in script F a non-negative real number P(A). A signed measure permits negative values. And of course at times want a measure that yields complex numbers.

Measure theory is in, say, P. Halmos, Measure Theory. The back of Halmos has a really nice introduction to probability theory, the Kolmogorov three series theorem in stochastic processes, etc. For more in stochastic processes there is, of course, the classic J. Doob, Stochastic Processes. Halmos was a Doob student and then served as an assistant to von Neumann at the Institute for Advanced Study. From that he wrote Halmos, Finite Dimensional Vector Spaces, a finite dimensional introduction to Hilbert space. Later at University of Chicago Halmos worked in mathematical statistics and made the fundamental contribution to sufficient statistics. IIRC, it was at Chicago that Halmos wrote Measure Theory.

Great details on measure theory and much more are in, say, H. Royden, Real Analysis and W. Rudin, Real and Complex Analysis.

To narrow measure theory to probability theory, use the famous texts by the authors I listed, Loeve, .... Royden and Rudin are terrific prerequisites to Loeve, etc.

Commonly the sigma algebra of events script F is the smallest sigma algebra that contains as a subset a topology, that is, a collection of open sets, on the sample space Omega.

That's all standard, beginning stuff in graduate probability -- ah, the only kind should bother with anyway!


We want all that for probability theory. We want countably infinite so that we can discuss, say, the event that a coin never or always comes up heads. We don't want uncountably infinite because it would create a big mess in the theory.

Who should I read for an introduction to that "big mess"?

So many of the cases I'm actually interested in require real valued (continuous) inputs and outcomes. While one can quantize these to create an approximation of something countable, it seems like that a much simpler theory would be possible if it was built from the ground up to handle these common non-discrete real-world cases, rather than trying to shoe-horn them into standard probability theory. I was hoping this might be the direction that Terry was headed, with the emphasis on probabilistic methods.


I think you're misunderstanding. Standard basic (measure-theoretic) probability theory is designed to handle common non-discrete real-world cases: continuous random variables like height, temperature, etc. They're not approximated by something countable; instead theorems proving that they have the sort of behavior you'd want are established by proving them for a countable approximation, then taking limits. It is exactly like integration: prove things for step functions, then make the steps infinitely thin. The theory is clean and straightforward.

Here's where it gets less basic: say you want to look at temperature over time, but you don't want to model temperature as a variable that's measured daily, or hourly, or even every second (secondly?), but you want to model it as a process that evolves in continuous time. That's where the theory gets messy. Not necessarily at the level of a user of this theory, but definitely at the level of proving that the math you want to use is allowed.

If you actually need an introduction to that sort of probability, Lawrence Evans (Berkeley) has some old lecture notes aimed at undergrads [1] that he turned into a book [2]. If (more likely) you want standard measure-theoretic probability theory (as opposed to what's taught to undergraduates), David Pollard's book is pretty good [3].

I'm sure that @graycat will scoff at those recommendations, but his reading list would be considered excessively hardcore and time consuming even for a graduate student in math, which I'm assuming you're not.

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.416...

[2]: http://www.amazon.com/An-Introduction-Stochastic-Differentia...

[3]: http://www.amazon.com/Theoretic-Probability-Statistical-Prob...

ps: after looking at it again, the intro in Evans's notes is as gentle as it's going to get, so start there. And (as you'll find out, unless you're some sort of savant) this shit's hard. If you actually want to understand this stuff, graduate coursework is probably the only practical way to do it.


You're right, and I misunderstood. I'm a computer programmer trying to rapidly learn enough about probability theory to be able to communicate with some theoretical statisticians regarding causality, confounding, and longitudinal data analysis. I have a decent intuitive grasp of what's happening, but no ability to convey anything with proper terminology. I could certainly use a better grasp of the basics, and I'm trying to figure out where to start. Thanks for the links.


Okay, for that stuff probability theory is too abstract. For basic basics, Edward Tufte has a $2 ebook that's pretty good: Data Analysis For Politics And Policy[1] and for terminology in causality, Rubin has a short open access paper[2]. For a freshman-stats level treatment, OpenStax college's book looks legitimate but I haven't actually read it carefully[3].

[1]: https://www.edwardtufte.com/tufte/ebooks [2]: https://projecteuclid.org/euclid.aoas/1223908042 [3]: http://cnx.org/contents/30189442-6998-4686-ac05-ed152b91b9de


Thanks for your references. I have always thought that many statistics/probability based explanations are adhoc. They are adhoc because they explain pre-selected facts; and their predictions are just a confirming instances (cf. positive vs confirming instance from Larry Laudan, a philosopher of science). Your point "definitely at the level of proving that the math you want to use is allowed" hints in that direction.


Probability's hard to teach. You can give informal statements and kind of wave your hands at the underlying theory, or you can give a rigorous well-founded treatment that's intellectually satisfying. But the rigorous foundation uses math that's a step or two beyond what undergraduate math majors learn. It's not necessarily harder than what math majors see, but it's a ton of extra material to teach, when the payoff is that you can now (after half a year) prove that the conditional probability is well-defined as

Pr(A | B) = Pr(A and B) / Pr(B)

instead of just telling it to students and drawing a few diagrams that drive the point home.

But I think it's more pragmatic than ad hoc. Any deep theory of probability that doesn't deliver

Pr(A | B) = Pr(A and B) / Pr(B)

is basically useless since that's how random phenomena seem to behave in real life. Having a deeper theory is useful because it allows you to derive other implications of that theory and makes certain calculations much easier. But if the theory disagrees with phenomena that we want to model, that can be a problem.


> I'm sure that @graycat will scoff at those recommendations, but his reading list would be considered excessively hardcore and time consuming even for a graduate student in math, which I'm assuming you're not.

Probability and stochastic processes based on measure theory are not very popular in the US, even in graduate math departments.

Uh, scoff, scoff. Okay?

The full measure theoretic details of stochastic processes in continuous time can be a bit of a challenge. That topic can be important, e.g., for Brownian motion and stochastic differential equations used in mathematical finance. Of course, there is

Karatzas and Shreve, Brownian Motion and Stochastic Calculus and Chung and Williams, Introduction to Stochastic Integration. And there's much more, especially from Russia and France.

But, otherwise, usually in practice, what people are interested in is either (1) second order stationary stochastic processes, e.g., as in electronic or acoustical signals and noise. There are commonly interested in power spectral estimation, digital filtering, maybe Wiener filtering, the fast Fourier transform, etc. or (2) what is in, say, Cinlar, Introduction to Stochastic Processes.

In Cinlar, for the continuous time case, get a good introduction to the Poisson process (the vanilla arrival process, e.g., like clicks at a Geiger counter, new sessions at a Web site, and much more). Also get what else people are mostly interested in in practice, Markov processes in discrete time with a discrete state space (that is, the values are discrete).

The case of Markov processes in continuous time and discrete state space is not so tough if the jumps are driven by just a Poisson process. But there is still more in Cinlar.

And there are other good texts on stochastic processes.

For (1), look at some of the texts used by EEs. The measure theory approach is in Doob, Stochastic Processes, Loeve, Probability Theory, and several more texts by quite good authors. E.g., without measure theory, can just dive in via Blackman and Tukey, The Measurement of Power Spectra ....

With all these sources, are able to get by without measure theory. Yes, without measure theory, at some places will have to not ask to understand too much and just skip over some details to get back to the applied stuff.

But for measure theory, the Durrett text seems to get a student to that unusually quickly.

For more, at the MIT Web site, there is an on-line course in mathematical finance that avoids measure theory. They want to use the Radon-Nikodym theorem and Ito integration but still avoid measure theory. Uh, the Radon-Nikodym theorem is a generalization of the fundamental theorem of calculus. Once see it, it's dirt simple, but a good proof takes a bit or follow von Neumann's proof that knocks it all off in one stroke (it's in Rudin, Real and Complex Analysis).


All of this is fine for real-valued inputs and outcomes. It's the number of events (coin flips, measurements, etc) that we're restricting to be countable.


No, the number of events is necessarily also finite or uncountable. Indeed, it is a nice exercise that there are no countably infinite sigma algebras (extra credit for a solution!).

It's just can't take uncountably many events, take their union, and assume that the result is also an event.


No problem. For the foundations I outlined, can work just fine with continuous functions, measurable functions, stochastic processes, random variables taking values on the real line, in the complex plane, in finite dimensional real or complex vector spaces with, say, the usual topology, Hilbert and Banach spaces, etc. Can do multi-dimensional Markov processes, and much more.

And you can have each point on the real line an event. Fine. But you just can't take the uncountable union of any set of such events and assume that the result is also an event.

As for the event a random variable takes a value >= 0? Fine.

Or, let the Borel subsets of the real line be the smallest sigma algebra that contains all the open sets, e.g., all the open intervals. Then for Borel set A and real valued random variable X, can ask for the probability X is in A.

I believe you will find that you will have a solid foundation for what you want.

To see all this stuff, need more than just the sparse definitions and, instead, need an actual text and maybe a course. Recently looked at the on-line materials from MIT and didn't see such a course. Graduate probability is not all that popular in the US; stochastic processes in continuous time is still less popular.

To study graduate probability, I'd recommend a good undergraduate major in pure math with good coverage of, say, W. Rudin, Principles of Mathematical Analysis. Then good coverage of linear algebra from more than one of the best known texts. Likely also spend as much time as you can in Halmos, Finite Dimensional Vector Spaces. E.g., at one time, Halmos, Rudin, and Spivak, Calculus on Manifolds were the three main texts for Harvard's famous Math 55.

Get good with proving the theorems.

I also recommend Fleming, Functions of Several Variables.

Then, sure, Royden, Real Analysis. Couldn't be prettier.

If not in a hurry, then the real half of Rudin's Real and Complex Analysis. Especially if you like Fourier theory!

Then of the probability books, I believe that the nicest, first book is L. Breiman, Probability. He wrote that before he went consulting and came back and did CART and random forests.

Next, K. Chung, A Course in Probability Theory. Next, J. Neveu, Mathematical Foundations of the Calculus of Probability. Then, Loeve, Probability Theory.

Loeve is huge -- mostly just use it for reference or browse. E.g., it has sufficient statistics and stationary stochastic processes (the EEs love that) IIRC not in the other books.

IIRC, both Breiman and Neveu were Loeve students at Berkeley.

If do well with Breiman, then for graduate probability, likely can stop there. Else, Chung will then be fast and easy reading and reinforce what you learned in Breiman. Neveu is elegant; my favorite, but deserve extra credit for each workable exercise you can find, not actually work, you understand, just find! Sure, some of the exercises are terrific, half a course in a few lines of an exercise. E.g., he has one of those on statistical decision theory or some such. And see the Tulcea material in the back.

Then there's more that you can do on stochastic processes, potential theory via Brownian motion, e.g., for mathematical finance, stochastic optimal control, and more.


>the only kind should bother with anyway!

I hope all this studying will pay off then, I have fremlin vol1/vol2 sitting underneath my text while I handle the per-requisite material hopefully the journey will pay off.

Measure theory can be used with a lot of stuff I guess.


In short measure theory is ordinary freshman calculus done in a way that is in some contexts significantly more powerful.

Here measure is essentially just a synonym for area, essentially just ordinary area.

Measure theory addresses both the differentiation and integration of freshman calculus, but most of the focus is on integration.

So, in freshman calculus, you are given, say, a real valued function of a real variable, say, function f where, for real numbers x, we have, e.g.,

f(x) = x^2 + 1

Then, say, we graph f and want the area under the curve for x in the interval [0,10]. Okay, you saw that more than 10,000 times in freshman calculus.

Well, in this case, measure theory will give the same answer for the area under the curve. The difference is how that area is calculated.

Here is the shortest description of where measure theory is different: As you recall, in freshman calculus you found the area under the curve by picking points on the X axis, that is, partitioning the X axis into little intervals, on each interval building a rectangle that approximated the area under the curve over that little interval, inserting more in the partition so that the longest of the little intervals had its length get as small as we pleased, adding up the areas of the rectangles, and taking the limit. That was it. That was the definition of what the integral was. Of course, to integrate x^2 + 1 you learned about, for any constant C, the anti-derivative

(1/3)x^3 + x + C

So, here's what measure theory does: Yup, it also works with a partition but, in this case, the partition is on the Y axis instead of the X axis. Then for each little interval on the Y axis, we get a horizontal bar and look at the parts that are under the curve. Then as we add points to the partition, we add up the ordinary areas of the relevant parts of the horizontal bars. The picture is less nice than in freshman calculus but, still, no biggie.

Now, how would one do, say, numerical integration? Sure: The same as in freshman calculus, say, Simpson's rule where work with little trapezoids. Nope, measure theory is not suggesting that we change that.

Here are four features of the approach of measure theory:

(1) There are some goofy, pathological functions that get the integration theory (the Riemann integral) of freshman calculus all confused. E.g., consider the function 0 on the rational numbers and 1 otherwise. Then the upper and lower Riemann sums of those little rectangles never converge to what we want. Bummer. Well, it follows from some theorems that the integral of measure theory rarely gets confused -- of course the theorems are very precise, but the generality is mind blowing, so much that it's darned tricky to construct a case where the measure theory integral fails.

(2) A big deal is what happens when a sequence of functions is used to approximate and converge to another function. A leading example is Fourier series. Well, on taking limits during this approximation, the Riemann integral can get confused when the measure theory integral (Lebesgue) does just fine.

H. Lebesgue was a student of E. Borel in France and did his work near 1900.

(3) Often we want to pass some limits under the integral sign. Again, Lebesgue does much better here than Riemann. Indeed, the Lebesgue integral has a super nice theorem on differentiation under the integral sign (from the TOC of Durrett, that theorem may be the last topic in that book -- it was a really fun exercise when I was studying that stuff).

(4) Notice a biggie: With Lebesgue, actually we used next to nothing about the X axis, that is, about the domain of the function we are integrating. In this way, right away, presto, bingo, we get a theory of integration that works on domains with much, much less in assumptions and properties than the real numbers or the usual finite dimensional real Euclidean vector space. In particular, we are now GO for doing probability theory -- the Lebesgue integral is used to define both probability of events and expectation of random variables. It was A. Kolmogorov in 1933 who noticed, and wrote a monster paper, on how to use measure theory and the Lebesgue integral for a solid foundation for probability theory. Since then for essentially all serious research in probability and stochastic processes, much of mathematical statistics, nearly all of stochastic optimal control, is based solidly on the Kolmogorov, i.e., measure theory, foundations.

So, from a mathematician not much interested in probability, probability theory is just a special case of measure theory where the total area (measure) is just 1. That's not literally or logically wrong but does discard a fantastic baby with any bathwater. Some of the results in probability are just astounding and powerful -- both beyond belief.

So, in measure theory, here is what a measure (a definition of area) is: We start with a non empty set, a space, say, X. From X we have some subsets we call measurable sets. So, measurable set A is a subset of X. In the special case of probability, the measurable sets are the events, that is, e.g., all the trials where our coin comes up heads.

We ask to have enough measurable sets so that all of them form a sigma algebra. Why? Otherwise we don't have much. A sigma algebra doesn't ask for much. The sigma part is supposed to suggest finite or countably infinite adding up, as in the usual use of the capital Greek sigma for summing.

Say our sigma algebra of subsets of our measurable space X is S (probability theory usually uses script F). Then we want the empty subset of X to be an element of S. For A in S we want X - A (the relative complement) to be an element of S. And for B(i) in S for i = 1, 2, ..., we want the union of all the B(i) to be an element of S. These conditions ensure that we will have enough sets in S to have a decently strong theory.

In what we commonly do in applied probability, we wouldn't settle for less. E.g., the empty set is the event it never happens. If H is the event of heads, then T = X - H, the relative complement, is the event tails. If H(i) is the event that flip i comes up heads, then the union of all the H(i) is the event the coin comes up heads at least once or doesn't always come up tails. In probability, those are all simple situations, and just from those we need a sigma algebra of events. And it turns out, that's enough, and has been since 1933.

So, for a measure, say, m, to measurable set A there is real number m(A), the measure (think area or, in probability theory, the probability) of A. Of course in probability we call the measure P instead of m and write P(A) for the probability of event A.

You can begin to see that we are essentially forced into how Kolmogorov applied measure theory whether we like it or not. Sorry 'bout that!

Well, for a measure m, we want countable additivity. So, for disjoint measurable sets B(i), i = 1, 2, ..., we want

m(union B(i)) = sum m(B(i))

for some sloppy math notation since I can't type TeX here!

Usually m(A) is real with m(A) >= 0, and commonly we need to have m(X) = infinity so that m can take on value infinity.

We can also extend to m(A) any real number or any complex number.

Measure theory is the total cat's meow for Fourier theory!

To get a sigma algebra of measurable sets we want, commonly we start with a topology, that is, its collection of open sets, and ask for the unique smallest sigma algebra for which each open set is also a measurable set in the sigma algebra.

When we do this on the real line and assign the measure of intervals their ordinary length and extend that to as many subsets of the reals as we can, we get Lebesgue measure for the real line. We get a lot of sets! It's a tricky exercise, that uses the axiom of choice, even to construct a subset of the reals that is not Lebesgue measurable. Powerful theory!

Suppose we have spaces X and Y, each with a sigma algebra of subsets and a function

f: X --> Y

Then f is measurable if for each measurable subset B of Y f^(-1)(B) is also a measurable subset of X. In measure theory, when we integrate a function, we ask that it be measurable. In the usual cases, it's even tough to construct a function that is not measurable. Darned near any limit of measurable functions is also measurable -- super nice theory.

In probability theory, a random variable is just a measurable function where its domain is a probability space, that is, a sample space Omega with a sigma algebra of subsets script F and a probability measure P.

Of course, there's much more, stacks, shelves, racks of books as difficult as you wish, but the above is a simple, intuitive view from 10,000 feet up.

Or, measure theory is a nicer theory of area and area under a curve that in all the simple cases gives you just what you have been used to!


I'm saving your responses to view later as I really appreciate them. Nassim also highly recommends Leo Breiman's text, I'll actually be working through the path you recommend , or something very similar.


> In the standard foundations of probability theory, as laid out by Kolmogorov, we can then model these events and random variables by introducing a sample space (which will be a probability space) to capture all the ambient sources of randomness; events are then modeled as measurable subsets of this sample space, and random variables are modeled as measurable functions on this sample space.

This matches the definitions of Wikipedia pretty closely: "A probability space consists of three parts: A sample space [...], A set of events [...], The assignment of probabilities to the events".

So either you misunderstood that sentence or both Wikipedia and Terrence Tao are wrong.


The sample space is not the probability space. The probability space has a sample space though, which is what the Wikipedia definition says, but not what Tao literally said. I think it is pretty clear what Tao meant though.


" ... a sample space (which will be a probability space)"

is, in a word, wrong. The sample space will be part of a probability space but will not "be a probability space".


Nonsense. Tao is using perfectly idiomatic language here - "The sample space will be a probability space (once we have endowed it with some additional structure)".


That's not good: We don't want to have to use words such as endowed and structure that are not precisely defined in the context and, thus, are conceptually fuzzy.

Some intuitive overviews, clearly labeled as such, are fine and can be helpful, but "idiomatic language" just is not. Won't find such in the writings of W. Rudin, P. Halmos, J. Neveu, or any of a long list of authors of some of the best math books. In a good math or computer science journal, a reviewer or the editor would likely reject "idiomatic language".

The definitions I gave are the accepted ones.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: