Hacker News new | past | comments | ask | show | jobs | submit login
The unreasonable effectiveness of conditional probabilities (two-wrongs.com)
214 points by kqr on Feb 22, 2023 | hide | past | favorite | 78 comments



I find this a quite terrible explanation of a fairly simple concept, but perhaps I haven't had enough coffee. Here's an alternative try:

First, for the computer scientists out there, this is the Coupon Collector's problem - so you probably know already it's Θ(N log N). That doesn't give us a precise answer but it should guide our intuition - it should take somewhere like 4 * log(4) ~= 8 draws to get what we want. [1]

Being exact:

When you have already have 3 of the 4 toys, what's the chance you get the 4th? It's 1 in 4. The expected number of draws to get an item with P=1/4 is 4.

Generalizing, the expected number of tries to get an item with probability p is 1/p. When you have 2 toys, you have a 1/2 probability of getting one of the two you haven't seen already, etc. When you have no toys, you always get something you don't have.

So the only slightly tricky case is when you have 1 toy, where you have a 3 in 4 chance of getting something new. 1/(3/4) = 1 1/3.

So the answer is very simply 4 + 2 + 1 1/3 + 1 = 8.3333.

More on the Coupon Collector's Problem: https://en.wikipedia.org/wiki/Coupon_collector%27s_problem

[1] The actual bounds use ln, not log2, but from a "getting intuition" perspective log2 gets us into the right range.


To put it in a different way, construct a Markov chain with these states and transitions:

(A_0) -> (A_1) -> (A_2) -> (A_3) -> (A_4)

State A_i = you have i unique toys. Each state from 1-4 also has a transition to itself.

The probability of going from A_i -> A_{i+1} = P(A_{i,i+1}) = (1 - i/4). This is pretty obviously: 1, 3/4, 2/4, and 1/4. The transition probability of staying in the same state is 1 - that = i/4, although we don't use those values in our calculations.

The expected time to reach A_4 is E[A_0 -> A_1] + E[A_1 -> A_2] + E[A_2 -> A_3] + E[A_3 -> A_4] = 1/P(A_{0,1}) + 1/P(A_{1,2}) + 1/P(A_{2,3}) + 1/P(A_{3,4}) = 1 + 4/3 + 2 + 4 = 8 1/3. This is true due both to linearity of expectation and the fact that each state only either stays the same or advances. If any state could transition to another state besides itself or the subsequent one, you'd have to do a more complex calculation.


When I started reading the article, I thought it was going to be an introduction to Markov Chains. But now that I’ve finished reading it, I guess we should expect an article named “The Unreasonable Effectiveness of Matrices” before that.


I agree - the OP’s explanation is terrible. What does this even mean:

> From a fundamentals perspective, the expected number of meals is a function of the probability of finding the fourth toy in any specific meal.

I can barely parse it. The probability of finding the nth toy before the nth meal is 0, so is the answer a function of 0? Or maybe the probability of finding the toy with index 4 in a specific meal is 1/4, and the answer is a function of 1/4?

Anyway, if you drink another coffee and stop reducing your fractions, you get a nicer way to say the answer :) . After getting n different toys, the probability of a duplicate on the next try is n/4, so the probability of a new toy is 1-n/4 = (4-n)/4, and it takes 4/(4-n) draws on expectation to go from n to n+1 toys. So to do from 0 toys to 4 toys takes 4 * (1/4 + 1/3 + 1/2 + 1/1) tries on expectation, because you can just add the expected number of tries for each new toy.


>> From a fundamentals perspective, the expected number of meals is a function of the probability of finding the fourth toy in any specific meal.

> I can barely parse it. The probability of finding the nth toy before the nth meal is 0, so is the answer a function of 0?

The probability of finding the fourth toy in the first meal is 0%.

The probability of finding the fourth toy in the second meal is 0%.

The probability of finding the fourth toy in the third meal is 0%.

The probability of finding the fourth toy in the fourth meal is 9.4% or whatever.

The probability of finding the fourth toy in the fifth meal is 37.5% or whatever.

Etc.

The expected number of meals is 1 x 0% + 2 x 0% + 3 x 0% + 4 x 9.4% + 5 x 37.5% + 6 x ... + ...


More precisely:

The probability of finding the fourth toy in the first meal is 0%.

The probability of finding the fourth toy in the second meal is 0%.

The probability of finding the fourth toy in the third meal is 0%.

The probability of finding the fourth toy in the fourth meal is 3/32 = 9.4%.

The probability of finding the fourth toy in the fifth meal is 9/64 = 14.1%.

The probability of finding the fourth toy in the sixth meal is 123/512 = 24.0%.

The probability of finding the fourth toy in the seventh meal is 135/1024 = 12.0%.

The probability of finding the fourth toy in the eighth meal is 903/8192 = 11.0%.

The probability of finding the fourth toy in the ninth meal is 1449/16384 = 8.8%.

Etc.

The expected number of meals is 1 x 0% + 2 x 0% + 3 x 0% + 4 x 9.4% + 5 x 14.1% + 6 x 24.0% + 7 x 12.0% + 8 x 11.0% + 9 x 8.8% + ...


A way to see the scaling this commenter quoted:

Suppose there are on average N items left to be seen at time t. Then using a continuous time approximation:

dN/dt = -N / N_tot

This gives

-dN / N = dt / N_tot

Integrate from N_tot to a small value at left and from 0 to t at right,

Log(N_tot /a) = t / N_tot,

Solving for t gives the scaling quoted.


After 9 purchases, what's the percentage certainty you'd have all four toys? If you had to pre-purchase all n at once from the outset, what's the value of n to be 99.9% certain you'd get all four?


Estimates from 67 million simulations

After 9: 71.13%

p99.9: 29

Ran it up to 2 billion. Still holds, but no additional precision! It's somewhere around 71.134%, so it was threatening to round up for a bit.


Got it up to 17 billion and it flipped to 71.135%! I’m away from my computer and can’t remember more digits.

I’m really surprised by this. I knew things converged slowly, but I didn’t appreciate how much and for how long it would move here. I’d be very appreciative if anyone calculated the number.


I got 71.13647% on a first pass. So we want P(all 4 in 9) = 1 - P(not all 4), and we can split that out a few ways. To not get all four, we can restrict ourselves to three, so that's (3/4)^9, but there are four ways of doing that, so that's 4 * (3/4)^9. But that counts using singles and pairs too many times. Specifically each version of "three" can be exactly three balls, three ways of one ball, or three ways of exactly two balls ("1 or 2 or 3" = "1&2&3" or "just 1" or "just 2" or "just 3" or "1&2" or "1&3" or "2&3").

- We can then subtract 6 * P(two balls), so 6 * (2/4)^9. Now this counts singles a few times too, in fact it cancels all of them out.

- We then need to add back four singles, so 4 * (1/4)^9

Putting this together gives:

1 - (4 * (3/4)^9 - 6 * (2/4)^9 + 4 (1/4)^9) = 0.7113647


This took me way too long to understand. :)

But I get it now. If the four prizes are ABCD, then if you calculate your chance of only getting two balls in five purchases, you can do it by calculating your chances to get "A or B" five times in a row. But those include the AAAAA and BBBBB scenarios, which aren't two balls.

Repeat that for AC, AD, BC, BD, and CD.



Don't know if you're still reading this thread, but one thing I'm stumped on is that if I plug in 8.333 instead of 9, I get a probability of around 65%.

If I plug in 7, I still get a probability of above 50%, which seems to conflict with the 8.333 answer calculated upstream.

Wouldn't a positive EV imply that a 50% probability is the break-even?


The reply button is back so I can thank you publicly!


Thanks for doing that. I suspected those answers were very different than the 8.3, which I'm guessing must have been around the 50th percentile. The way the initial question was asked, it seemed to suggest you had a pretty high probability of all four after 8.3, which isn't true.


It hinges on the "expect" in how "much do you expect to pay?".

You can think of it like "expected value", where you're doing a comparison. Those meals are $8 each. If someone says they'll give you $80 for a set, this would be a profitable endeavor. On average, you'd expect to spend $66.67 (8.33*8) to get a collection.

If you think of it like "setting expectations", it's not so good. If a mafia boss told you he was expecting you to get all four toys for his daughter, you're not going to place an order for nine meals and walk in there with 71% confidence. I might show up with 59 of them, which is the p99.9999 value.


My initial hunch was also to calculate the answer as 1 + 4/3 + 2 + 4. Ie, "average tries to get 1st toy, plus average tries to get 2nd toy, plus ...."

However, my memory of probability is very fuzzy and I wasn't sure if my above intuition was mathematically rigorous. Ie, can the expected values of these 4 events be simply summed in order to get the combined expected value?

For anyone else curious about this, here's the mathematical proof that you can indeed do so. Ie, proof that E(X+Y) = E(X) + E(Y). Where X can be defined as the number of tries to get the 1st toy, Y is the number of tries to get the 2nd toy, and so on: https://www.milefoot.com/math/stat/rv-sums.htm

In this specific example, X and Y are independent random variables. But if my reading of the above proof is correct, you can still sum the expected values even if X and Y happen to be correlated.


Linearity of expectation is of course true in general, but in this case summing the expectations of the individual transition probabilities is only valid because the only valid state transitions are going from state i -> i+1 (owning i unique toys to i+1 unique toys) or just staying in state i. If you could magically go from state i to any other state (you lose unique toys or gain more than one unique toy), then this simple calculation would not be correct.


Doesn't it depend on the #/ distribution of coupons printed/toys manufactured though? To take an extreme example if there are 4 different toys and only 10 of each are manufactured, then the probability of getting 4 all the same is surely somewhat lower than it would be if you assume essentially an infinite toy supply where the chance of getting a particular toy is always 1/4.


We're effectively assuming that the number of toys are infinite and equally likely.

Or, to put it slightly more formally, we're choosing each time from 4 equally likely possibilities, with replacement. Imagine you're drawing from a bag with 4 toys inside. How long until you've drawn each of the 4 toys, if you replace the toy in the bag after each choice?


Yes I gathered, but in reality there's not an infinite number all produced in equal quantities. I'm curious how much it might change how many meals you'd need to buy if it was a limited production run, and even whether the total number of participants would make much difference (clearly it must make some - if only 1000 of each toy were produced and 1000 customers took part, surely very few would have been lucky enough to get all 4 toy types at all, as on average they could have only received 4).


> Yes I gathered, but in reality there's not an infinite number all produced in equal quantities.

Most of the McDonald's toys are made in equal quantities, and the quantity available dwarfs the quantity any one customer buys.

> clearly it must make some - if only 1000 of each toy were produced and 1000 customers took part, surely very few would have been lucky enough to get all 4 toy types at all

I don't think the odds change much in this case. 4 provides you with a very small chance from a bottomless pit -- 3/4 * 1/2 * 1/4 = .09375 chance of getting all 4 with with purchased.

It's slightly better if there's a small number, say, 4 of each, because you remove the ones you don't want from contention.

You have a 12/15 chance of the second not matching (3 toys remain that you don't want; and 4 toys each remain of each of the 3 kinds you want); then a 8/14 chance of the third not matching, and then a 4/13 chance of the fourth one not matching-- or a ~14% chance. So even when we say there's only 4 of each toy, the odds don't change -that- much.

Once we get to 1000 of each, the odds basically are the same as the infinite case. 3000/3999, 2000/3998, 1000/3997 are practically 3/4, 1/2, 1/4 -- ~.09389.

> as on average they could have only received 4

Well, we already know from the infinite case you need to buy more than 4 to have a good shot.


On the first problem, if you’re already going to estimate the answer with finite iterations over an infinite series, I think you should just skip ahead to the simulation that was used to test it.

The header for the test section says “Validation” but that’s not what’s happening. It’s not taking some answer and seeing if it’s correct. It’s coming to the answer in a different way, in a way that’s faster to write, faster to run and easier to understand than the other code.

Easier to extend and modify, too. Uneven weights for the toys? Easy. Want the P99 case? The standard deviation? Easy. Simulate it ten million times and see what comes out.

I’ve just been hyped on simulations over calculations for probabilities lately and this is wonderful confirmation bias.


Yeah I think simulations are massively underused.

They might not help you prepare for a black swan type event, but a lot of times we're not ready for very normal problems. I suspect a day or two per year spent doing basic modeling and simulation would benefit a lot of teams, even back of the envelop stuff. Then you can return to the exercise a year later and see if your assumptions were on point or what turned out to be much different than expected and refine the model and parameters of the simulation.


They're actually great for preparing black swan events. Taleb talks about them in "The Black Swan".

Usually the rules of the game aren't as clearly stated as in this problem. Sometimes there is a literal coupon contest, or a node randomly pairs with another node. But usually you're looking at history to try to to understand the odds and then forecast some futures. The problem with doing that with statistics is it's not very receptive to "fiddling", even if it's Bayesian. With a simulation it's generally easier to throw something weird in there, like an effect orders of magnitude larger, dynamic effects, or a totally new rule.


> They might not help you prepare for a black swan type event

at some point probabilistic simulations were all the rage in the financial industry and then the financial crisis happened

it is a fine art to figure out how to elicit the benefits of Monte Carlo simulation to put some structure around the future without falling into a mental trap that blinds you to non-quantifiable scenarios


Depending on how much precision you'd like for your answer, Monte Carlo methods can converge unbearably slowly, since the standard deviation is proportional to 1/sqrt(N) as you add more samples. I recall once having to run 100 billion simulations to get the mean of a certain quantity to within a few decimal places. I'd take an infinite series over that ant day of the week (well, as long as it's not something as slow as the Leibniz series for pi/4).

Of course, such precision relative to the standard deviation is rarely necessary for most practical situations.


As a friend in grad school used to say... "An infinitely long Monte Carlo simulation recapitulates the underlying probability distribution perfectly, but we don't want to wait that long.":


Yeah, it’s the practicality. Most business decisions don’t need more than three sigfig [source?]. We only tend to think about precision on things that already have it.


Unless you are trying to prove the existence of a subatomic particle, I am not sure what business decisions have anything close to that level of certainty.


Kinda depends on what you mean by "business decisions". Repetition can bring in a high need for confidence. You don't want to have your life depend on a piece of equipment that has a one in a million chance of failure if it gets tested a thousand times an hour. Pricing things can also need a lot of certainty, like in HFT. But most things aren't either of those cases and people are commonly off by an order of magnitude.


"The Unreasonable Effectiveness of Monte Carlo Methods"


"The Unreasonable Effectiveness of reading a statistics textbook." There is enough material for, like, 100 blogposts or maybe even more!


Just making a simulation and codifying assumptions is hugely helpful. Forcing SMEs to rationalize a number and its spread has so much more utility than a closed form calculation which people have to take on faith.


The unreasonable effectiveness of articles having the phrase "Unreasonable Effectiveness" in their title on HN: https://hn.algolia.com/?q=unreasonable+effectiveness


It drives me crazy seeing articles with this title everywhere. Glad I'm not alone.


It's ok when it's merited, but often it's not, this article being one case in point.


Conditional probabilities are reasonably effective because they are the mathematical underpinning of probabilities.

They are not unreasonable effective.

It's just that sampling from probability distributions is computationally expensive, which has lead to the development of frequentist statistics (as opposed to Bayesian statistics). Frequentist statistics is unreasonably effective, since it makes so many assumptions that its hard to do a frequentist statistical analysis without violating some of the assumptions the methods have with regards to the data distributions, etc.


Let E(n) be the expected number of meals to collect at least one each of n toys if there are n possible toys available. We want to find E(4).

Note that

  E(4) = 1 + 4/3 E(3)
That's because after the first meal, there are now 3 toys we need to collect. If there were only 3 toys available we would expect to take E(3) meals to get them all. But since there are 4 available, 1/4 of the time we get another one of the toys we got on the first meal, so only 3/4 of the meals are productive toward our goal of collecting the remaining 3 toys, so it takes 4/3 as long.

Similarly,

  E(3) = 1 + 3/2 E(2)
  E(2) = 1 + 2/1 E(1)
If there were only 1 toy available it would take exactly 1 meal to collect it, so

  E(1) = 1
  E(2) = 1 + 2/1 1 = 3
  E(3) = 1 + 3/2 3 = 11/2 = 5 1/2
  E(4) = 1 + 4/3 11/2 = 50/6 = 25/3 = 8 1/3


Published 2023-03-18. Futuristic :-)


The unreasonable use of the word unreasonable in the title


The title is a meme like “X considered harmful”


I don't think this problem needed conditional probabilities. I had a go at coming up with a combinatorics approach and a simulation approach to verify the first attempt. It's here if you're interested: https://medium.com/@benjamin22-314/get-all-four-toys-b8653c2...


Pet peeve. Just exactly what about it is unreasonable?


It's a reference https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

It's interesting that we can start with some stuff, get mathematical conclusions, and then the mathematical conclusions lead to results about Physics. That's crazy! Well, not really, but it's cool, right?

But it's mostly used as a humorous snowclone here. So yeah, it is what it is, and you must remain peeved. There is no relief.


I think physics starts with some very basic assumptions about the world because that's what the world looks like. So I don't find it surprising that the models physics starts with, and their consequences would already have been investigated by mathematicians.

What is a physics problem if not a math-problem? So I never really understood the point of mathematics being unreasonably effective. What am I missing?


I know the reference. Hence the pet peeve.


For "Game of Stopped Dice" the article states "we would keep our winnings on the first throw only if they are 5 or 6". Would it not be better to cash out even at $4 and play again using those winnings?


Only if you were playing with your last four dollars in the world. But in that case you might not want to risk it anyway. Assuming you have more money to risk, you're better off with an expected $4.25 than a sure $4.


They may have realised they're playing a bad bet and won't let you play again.

I kid of course, but the implicit aim of these exercises is always to maximise expected value. The underlying assumption is that you can make a lot of independent bets and the results will average out thanks to the law of large numbers. Suppose you were playing this game 1000 times in parallel. Would you rather average zero profit at the end or $667?


The first problem is much simpler framed around Bernoulli sequences, each concluding when one gets another distinct toy. Then, it's just the expectation of the sum of four geometric random variables

If you have zero toys, the probability of getting a new distinct toy is 1. With 1, it's 3/4. With 2, it's 1/2. With 3, it's 1/4.

Then, expected number of meals to get all the toys is 1/1 + 1/(3/4) + 1/(1/2) + 1/(1/4)


> The unreasonable effectiveness of conditional probabilities

Yup.

There is a more general result that is more amazing with still more "unreasonable effectiveness":

For events A and B, we can consider the conditional probability of A given B:

P(A|B) = P(A and B)/P(B)

Then we can generalize P(A|B):

Suppose we do an experiment and make two observations, X and Y. We regard each of X and Y as a random variable that takes on real number values.

We assume that X and Y have averages, that is, expectations, and write those as E[X] for the expectation of X and E[Y] for the expectation of Y.

We want to use X to approximate Y.

So, we want a real valued function f of one real variable x and use f(X) to approximate Y.

For the most accurate approximation, we want to minimize the expectation

E[(Y - f(X))^2]

Claim: For f(X) we want

f(X) = E[Y|X]

where E[Y}X} is the conditional expectation of Y given X and a generalization of P(A|B).

Proof:

We start by using one of the properties of conditional expectation and then continue with just simple algebra:

E[(Y - f(X))^2]

= E[ E[(Y - f(X))^2|X] ]

= E[ E[Y^2 - 2Yf(X) + f(X)^2|X] ]

= E[ E[Y^2|X] - 2f(X)E[Y|X] + f(X)^2 ]

= E[ E[Y^2|X] + E[Y|X]^2 - 2f(X)E[Y|X]

+ f(X)^2 - E[Y|X]^2 ]

= E[ E[Y^2|X] + ( E[Y|X] - f(X) )^2 - E[Y|X]^2 ]

which is minimized with

f(X) = E[Y|X]

Done.

So, the "unreasonable effectiveness" of conditional expectation that it provides the best non-linear least squares approximation.

In terms of statistics E[Y|X] is the minimum variance estimator of Y and is also unbiased.

Yes, P(A|B), conditional probability, can be seen as a special case of E[Y|X], conditional expectation.

The detailed definition of E[Y|X] is based on the Radon-Nikodym theorem. W. Rudin's Real and Complex Analysis gives the novel von Neumann proof. Then in Neveu, Mathematical Foundations of the Calculus of Probability is elegant coverage of E[Y|X].


> For the most accurate approximation, we want to minimize the expectation

> E[(Y - f(X))^2]

I don't think it's obvious why you would want that. It looks like you smuggle in your conclusion with that assumption.

I would start with establishing that probability space of variables with second moments is a Hilbert space with the inner product of X and Y defined as E[XY], which lends itself to a nice geometrical interpretation. Then we can project Y onto X, hence Y = Y_∥ + Y_⊥ where Y_∥ is an orthogonal projection and Y_⊥ is the leftover part that is orthogonal to X, ie X can perfectly predict Y_∥ but cannot say anything about Y_⊥. So Y_∥ by construction contains all information we can have about Y if we know X. We can also note that orthogonal projection minimizes distance, i.e. it minimizes E[(X-f(Y))^2].


> I don't think it's obvious why you would want that.

To minimize

E[(X-f(Y))^2]

is just common, traditional, nearly universal least squares with all the powers, advantages, and attributes there unto pertaining!

Yes, can also connect the least squares with Hilbert space geometry, but for that you add an assumption not needed with my approach via the Radon-Nikodym result.


> To minimize E[(X-f(Y))^2] is just common, traditional, nearly universal least squares with all the powers, advantages, and attributes there unto pertaining

For me that's hardly an intuitive notion. For example, in some econometrics textbooks you can meet the reversed argument: OLS is motivated as a way to estimate conditional expectation function. So one can ask, isn't OLS "just common, traditional, nearly universal least squares with all the powers, advantages, and attributes there unto pertaining" simply because it is a way to estimate conditional expectation? And now you are back to square one.

> you add an assumption not needed with my approach via the Radon-Nikodym result

Existence of a unique minimizer of E[(X-f(Y))^2] is a property of L2. You may carryover many properties to L1 by continuity but it isn't one of them. If you don't assume existence of second moments, it is non-trivial to show that min E[(X-f(Y))^2] makes sense if any. I think that you have to put specific constraints on X and Y, but I don't know which are the least restrictive.


> If you don't assume existence of second moments, it is non-trivial to show that min E[(X-f(Y))^2] makes sense if any.

To show that

E[Y|X]

exists, essentially I did that.

Really, all need from X is its sigmaalgebra. That is, can do the argument conditioning on a sigmaalgrbra instead of a random variable X. For Y, I didn't include the details back to the Radon-Nikodym theorem, but may not even need that E[Y] is finite.

Sure don't need finite "second moments".


You assumed that E[Y|X] and E[Y^2|X] exist.


E[Y^2|X] exists?

Doesn't ask for much, merely that Y be measurable, that is, a random variable. We get to f'get about Riemann integration of probability densities.

Full details, with some of the most elegant, powerful, and precise work in the history of civilization, are in the references I gave, to Rudin and Neveu.


Uhm. Not every random variable has the second moment. And wasn’t that your objection towards the formulation in terms of Hilbert space? I am a bit lost trying to understand what you are trying to argue.


For a real random variable Y, E[Y^2] always exists although might be infinity.


Moments and Hilbert space need have nothing to do with real valued function f of real variable x minimizing

     E[ (Y - f(X)^2 ]
for real valued random variables X and Y. For Y it is enough that it has an expectation. For X, it need only be measurable, that is, a random variable.

I referenced the Radon-Nikodym theorem. Assuming that, the definition and basic properties of

     E[Y|X]
are immediate.

Let's back up to something simpler: What real number a minimizes

     E[ (Y - a)^2 ]
     = E[ Y^2 - 2aY + a^2 ]
     = E[Y^2] - E[2aY] + E[a^2]
     = E[Y^2] - 2aE[Y] + E[a^2]
     = E[Y^2] + E[Y]^2 - 2aE[Y]
       - E[Y]^2 + E[a^2]
     = E[Y^2] + E[Y]^2 - 2aE[Y] - E[Y]^2 + a^2
     = E[Y^2] + (E[Y] - a)^2 - E[Y]^2
which is minimized by

     a = E[Y]
So, if real random variable

     X = a
that is, takes on only one value, then

     E[ (Y - a)^2 ]
is minimized by

     a = E[Y]
To improve on that case, suppose real random variable X is discrete, takes on finitely many values

     a_i
for positive integer n and

     i = 1, 2, ..., n
and suppose random variables

                    / X = a_i
     1_{X = a_i} =  |
                    \ 0 otherwise
Then

     X = sum_{i = 1}^n a_i 1_{X = a_i}
Let

     b_i = E[Y 1_{X = a_i}]
Then by what we did above for

     a = E[Y]
minimizing

     E[ (Y - a)^2 ]
we have that

     f(X) = sum_{i = 1}^n b_i 1_{X = a_i}
minimizes

     E[(Y - f(X))^2]
And we would have

     E[Y|X] = f(X)
Basically we are doing classic cross tabulation: We collect data on both X and Y. When X = a_i, we average the corresponding Y values and get b_i for our estimate of Y.

The Radon-Nikodym result, among other things, lets us do the same when X is not discrete, that is, is any random variable. We don't even need for X to have an expectation -- all we want is measurability so that from X we have a sigma algebra.

The Radon-Nikodym result is part of measure theory, with a novel von Neumann proof in Rudin. The connection with probability and conditional expectation is part of graduate probability as in Neveu.

My all-time favorite course in school was the graduate probability course I took from Neveu, etc.

I never wanted to be a college professor, but for some unusual reasons for some years I was. Now I'm happy that I'm no longer a college professor!

I've published some peer-reviewed papers of my original research in pure/applied math. I don't like publishing and don't want to publish. In particular, I don't want to write a lecture in graduate probability here on Hacker News.

But you can see above that I made no use of moments.

This thread started with an observation that conditional probability is amazing. Then, easy enough, I mentioned that conditional probability is a special case of conditional expectation which has an amazing minimization property.

In addition, conditional expectation is the core of both Markov processes and martingales. The martingale convergence theorem is astounding.

My real interest now is my Internet startup -- back to it.


I am not sure what you mean by moments, but in probability theory “the n-th moment” means E[X^n]; and you’ve clearly used them.

In probability theory the Radon-Nikodym is used to justify existence of random variables. But it cannot be used to show that any variable has an expected value (ie has a moment, ie is integrable). The Cauchy distribution is a traditional counter example to such assertion.

> For Y it is enough that it has an expectation

That is false. Just look at your “simple example”. You arrive to E[Y^2]. An expectation is not enough. It should be clear intuitively because MSE is a minimization of the second moment of a certain rv, so you are bound to have at least some constraints for second-order relationships for the variables you’ve used to construct that certain rv.

I think you may need to re-read Rudin and Neveau and refresh your knowledge. (Though I can say nothing about Neveau, I studied using the textbook by Shao.)


For any real valued random variable,

     E[Y^2]
always exists although it might be infinity.

You mentioned moments -- as I recall some writing in statistics uses that word, but I never have.

Your description of the Radon-Nikodym (R-N) result does not look right. Again, I don't want to write a lecture in graduate probability, but from memory the usual description of that result goes:

Given real valued, integrable function f and sigma algebra s, there exists real value function g measurable with respect to sigma algebra s so that g integrates like f over all sets in s. So, with real valued random variable Y in place of function f, we let

     E[Y|s]
be the function g. If s is the sigma algebra generated by real random variable X, then we write

     g(X) = E[Y|X] = E[Y|s]
So the R-N result makes immediate the definition of E[Y|X] and shows that it exists.

If you want a perfect, polished presentation, see the two references I gave, Rudin and Neveu.

My main contribution here remains:

     f(X) = E[Y|X]
minimizes

     E[ (Y - f(X))^2 ]
and is a minimum variance, unbiased estimator of Y. That's a generalization of the start of this thread and a bit amazing.


> Given real valued, integrable function f

Of course that makes it possible. Note that it doesn’t say that f is simply measurable. By definition being integrable means E[|f|] < infinity. (And being square-integrable means E[f^2] < infinity, an even stronger condition).

Consider Y = Z/X where Z, X iid ~ N(0,1). Try to calculate E[Y|X], and try to minimize E[ (Y - f(X))^2 ]. Neither will make sense. I am sure both Rudin and Neveu were aware of this trivial result.


> Of course that makes it possible. Note that it doesn’t say that f is simply measurable.

Irrelevant.

Again, once again, over again, yet again, for real valued random variable Y,

     E[Y^2] 
always exists.

> By definition being integrable means E[|f|] < infinity.

Wrong.

Correcting you is a waste of my time.


You are not correcting me, you are correcting Rudin :)


Alternatively my take is that while a function takes a thing of the domain and spits out another thing of the codomain, a conditional probability takes a thing of the domain and spits out a distribution on the things of the codomain. With P(c=C|d=D) with D=domain and C=codomain. With finite C and D that suffices. If the distributions are 1-hot you get a function.


For the collectible toy problem, how much money would you save if you paired up with someone else and purchased meals until you both had all 4 toys? My intuition says going from 1 collector to 2 collectors would help significantly (like queuing theory).


Hm. I'm sad nobody answered this. I'm also on my phone and in a bad position to give an exact answer. However, rolling a few scenarios[1] and counting manually, I get 16 meals on average, with a standard error of 5. So not conclusive by any stretch, but it seems like it's roughly as expensive to be two people as it is to be one person.

[1]: https://www.random.org/integers/?num=24&min=1&max=4&col=4&ba...


Oops miscalculation -- mean 14, standard error around 1. So at this point I can fairly confidently say it helps to pool the effort of two people, though it helps maybe less than one would think coming from queueing theory.


I wrote a blog about the general case https://nickp.svbtle.com/general-collectors-problem


Other commenters made other comments, but I also want to note that this article is terrible for another reason. For the first problem, they computed mathematically all the way up to calculating the recursive expression they got. After that they moved to computing the value with Python. The reason this is bad is because writing the Python code to calculate ~8.3564 was easy in the first place. They made basic mathematical analysis that wasn't even necessary, then moved to code to actually solve the problem. To be instructional, they could compute 8.333 solution mathematically (as e.g. dgacmu or vecter in this thread did) and then later move to Python for an alternative solution. This is not only confusing but also incomplete for both approaches.


The first problem is known as "Coupon collector's problem".


> published 2023-03-18

straight from the future?


While I wouldn't dream of posting ChatGPT content, here are some questions you could give to ChatGPT to a make this all a fair bit clearer:

First, what is an expectation value, as defined in the field of statistics?

Second, is there some reason an expectation value wouldn't behave like a strange attractor from dynamical systems theory and sort of wander about, rather than settle down to a precise value?

Third, what's the connection (if any) between conditional probability and expectation values?

For the first example, a more interesting real world-problem would be one where the total number of different outcomes (toys) is unknown - say, counting the number of species of beetle that exist on planet Earth by steadily sampling all the habitats one can find - how long would the tails be?

Variations of the latter examples might be odds that vary over time in conjunction with costs-of-entry that vary over time - what's the acceptable range of odds & costs to justify playing with some expectation of at least coming out even? That's a bit more like real-world markets.


I think you were downvoted (not by me) because your questions, except for the one about strange attractors, are about basic topics in probability theory. So you should probably read a book or take a class. The one about strange attractors is more interesting and maybe not studied enough, but you could look at https://en.wikipedia.org/wiki/El_Farol_Bar_problem for a related example.


Why would chatGPT give you answers to these questions that you could trust to be correct and not give you a plausible-sounding but wrong answer and thereby a false sense of confidence?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: