The unreasonable effectiveness of conditional probabilities

dgacmu · on Feb 22, 2023

I find this a quite terrible explanation of a fairly simple concept, but perhaps I haven't had enough coffee. Here's an alternative try:

First, for the computer scientists out there, this is the Coupon Collector's problem - so you probably know already it's Θ(N log N). That doesn't give us a precise answer but it should guide our intuition - it should take somewhere like 4 * log(4) ~= 8 draws to get what we want. [1]

Being exact:

When you have already have 3 of the 4 toys, what's the chance you get the 4th? It's 1 in 4. The expected number of draws to get an item with P=1/4 is 4.

Generalizing, the expected number of tries to get an item with probability p is 1/p. When you have 2 toys, you have a 1/2 probability of getting one of the two you haven't seen already, etc. When you have no toys, you always get something you don't have.

So the only slightly tricky case is when you have 1 toy, where you have a 3 in 4 chance of getting something new. 1/(3/4) = 1 1/3.

So the answer is very simply 4 + 2 + 1 1/3 + 1 = 8.3333.

More on the Coupon Collector's Problem: https://en.wikipedia.org/wiki/Coupon_collector%27s_problem

[1] The actual bounds use ln, not log2, but from a "getting intuition" perspective log2 gets us into the right range.

vecter · on Feb 22, 2023

To put it in a different way, construct a Markov chain with these states and transitions:

(A_0) -> (A_1) -> (A_2) -> (A_3) -> (A_4)

State A_i = you have i unique toys. Each state from 1-4 also has a transition to itself.

The probability of going from A_i -> A_{i+1} = P(A_{i,i+1}) = (1 - i/4). This is pretty obviously: 1, 3/4, 2/4, and 1/4. The transition probability of staying in the same state is 1 - that = i/4, although we don't use those values in our calculations.

The expected time to reach A_4 is E[A_0 -> A_1] + E[A_1 -> A_2] + E[A_2 -> A_3] + E[A_3 -> A_4] = 1/P(A_{0,1}) + 1/P(A_{1,2}) + 1/P(A_{2,3}) + 1/P(A_{3,4}) = 1 + 4/3 + 2 + 4 = 8 1/3. This is true due both to linearity of expectation and the fact that each state only either stays the same or advances. If any state could transition to another state besides itself or the subsequent one, you'd have to do a more complex calculation.

LudwigNagasena · on Feb 23, 2023

When I started reading the article, I thought it was going to be an introduction to Markov Chains. But now that I’ve finished reading it, I guess we should expect an article named “The Unreasonable Effectiveness of Matrices” before that.

amluto · on Feb 22, 2023

I agree - the OP’s explanation is terrible. What does this even mean:

> From a fundamentals perspective, the expected number of meals is a function of the probability of finding the fourth toy in any specific meal.

I can barely parse it. The probability of finding the nth toy before the nth meal is 0, so is the answer a function of 0? Or maybe the probability of finding the toy with index 4 in a specific meal is 1/4, and the answer is a function of 1/4?

Anyway, if you drink another coffee and stop reducing your fractions, you get a nicer way to say the answer :) . After getting n different toys, the probability of a duplicate on the next try is n/4, so the probability of a new toy is 1-n/4 = (4-n)/4, and it takes 4/(4-n) draws on expectation to go from n to n+1 toys. So to do from 0 toys to 4 toys takes 4 * (1/4 + 1/3 + 1/2 + 1/1) tries on expectation, because you can just add the expected number of tries for each new toy.

kgwgk · on Feb 22, 2023

>> From a fundamentals perspective, the expected number of meals is a function of the probability of finding the fourth toy in any specific meal.

> I can barely parse it. The probability of finding the nth toy before the nth meal is 0, so is the answer a function of 0?

The probability of finding the fourth toy in the first meal is 0%.

The probability of finding the fourth toy in the second meal is 0%.

The probability of finding the fourth toy in the third meal is 0%.

The probability of finding the fourth toy in the fourth meal is 9.4% or whatever.

The probability of finding the fourth toy in the fifth meal is 37.5% or whatever.

Etc.

The expected number of meals is 1 x 0% + 2 x 0% + 3 x 0% + 4 x 9.4% + 5 x 37.5% + 6 x ... + ...

kgwgk · on Feb 22, 2023

More precisely:

The probability of finding the fourth toy in the first meal is 0%.

The probability of finding the fourth toy in the second meal is 0%.

The probability of finding the fourth toy in the third meal is 0%.

The probability of finding the fourth toy in the fourth meal is 3/32 = 9.4%.

The probability of finding the fourth toy in the fifth meal is 9/64 = 14.1%.

The probability of finding the fourth toy in the sixth meal is 123/512 = 24.0%.

The probability of finding the fourth toy in the seventh meal is 135/1024 = 12.0%.

The probability of finding the fourth toy in the eighth meal is 903/8192 = 11.0%.

The probability of finding the fourth toy in the ninth meal is 1449/16384 = 8.8%.

Etc.

The expected number of meals is 1 x 0% + 2 x 0% + 3 x 0% + 4 x 9.4% + 5 x 14.1% + 6 x 24.0% + 7 x 12.0% + 8 x 11.0% + 9 x 8.8% + ...

onos · on Feb 22, 2023

A way to see the scaling this commenter quoted:

Suppose there are on average N items left to be seen at time t. Then using a continuous time approximation:

dN/dt = -N / N_tot

This gives

-dN / N = dt / N_tot

Integrate from N_tot to a small value at left and from 0 to t at right,

Log(N_tot /a) = t / N_tot,

Solving for t gives the scaling quoted.

tunesmith · on Feb 22, 2023

After 9 purchases, what's the percentage certainty you'd have all four toys? If you had to pre-purchase all n at once from the outset, what's the value of n to be 99.9% certain you'd get all four?

travisjungroth · on Feb 22, 2023

Estimates from 67 million simulations

After 9: 71.13%

p99.9: 29

Ran it up to 2 billion. Still holds, but no additional precision! It's somewhere around 71.134%, so it was threatening to round up for a bit.

travisjungroth · on Feb 23, 2023

Got it up to 17 billion and it flipped to 71.135%! I’m away from my computer and can’t remember more digits.

I’m really surprised by this. I knew things converged slowly, but I didn’t appreciate how much and for how long it would move here. I’d be very appreciative if anyone calculated the number.

colmanhumphrey · on Feb 23, 2023

I got 71.13647% on a first pass. So we want P(all 4 in 9) = 1 - P(not all 4), and we can split that out a few ways. To not get all four, we can restrict ourselves to three, so that's (3/4)^9, but there are four ways of doing that, so that's 4 * (3/4)^9. But that counts using singles and pairs too many times. Specifically each version of "three" can be exactly three balls, three ways of one ball, or three ways of exactly two balls ("1 or 2 or 3" = "1&2&3" or "just 1" or "just 2" or "just 3" or "1&2" or "1&3" or "2&3").

- We can then subtract 6 * P(two balls), so 6 * (2/4)^9. Now this counts singles a few times too, in fact it cancels all of them out.

- We then need to add back four singles, so 4 * (1/4)^9

Putting this together gives:

1 - (4 * (3/4)^9 - 6 * (2/4)^9 + 4 (1/4)^9) = 0.7113647

tunesmith · on Feb 24, 2023

This took me way too long to understand. :)

But I get it now. If the four prizes are ABCD, then if you calculate your chance of only getting two balls in five purchases, you can do it by calculating your chances to get "A or B" five times in a row. But those include the AAAAA and BBBBB scenarios, which aren't two balls.

Repeat that for AC, AD, BC, BD, and CD.

colmanhumphrey · on Feb 24, 2023

exactly! It's a version of https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_pr...

tunesmith · on Feb 27, 2023

Don't know if you're still reading this thread, but one thing I'm stumped on is that if I plug in 8.333 instead of 9, I get a probability of around 65%.

If I plug in 7, I still get a probability of above 50%, which seems to conflict with the 8.333 answer calculated upstream.

Wouldn't a positive EV imply that a 50% probability is the break-even?

travisjungroth · on Feb 25, 2023

The reply button is back so I can thank you publicly!

tunesmith · on Feb 22, 2023

Thanks for doing that. I suspected those answers were very different than the 8.3, which I'm guessing must have been around the 50th percentile. The way the initial question was asked, it seemed to suggest you had a pretty high probability of all four after 8.3, which isn't true.

travisjungroth · on Feb 22, 2023

It hinges on the "expect" in how "much do you expect to pay?".

You can think of it like "expected value", where you're doing a comparison. Those meals are $8 each. If someone says they'll give you $80 for a set, this would be a profitable endeavor. On average, you'd expect to spend $66.67 (8.33*8) to get a collection.

If you think of it like "setting expectations", it's not so good. If a mafia boss told you he was expecting you to get all four toys for his daughter, you're not going to place an order for nine meals and walk in there with 71% confidence. I might show up with 59 of them, which is the p99.9999 value.

whack · on Feb 22, 2023

My initial hunch was also to calculate the answer as 1 + 4/3 + 2 + 4. Ie, "average tries to get 1st toy, plus average tries to get 2nd toy, plus ...."

However, my memory of probability is very fuzzy and I wasn't sure if my above intuition was mathematically rigorous. Ie, can the expected values of these 4 events be simply summed in order to get the combined expected value?

For anyone else curious about this, here's the mathematical proof that you can indeed do so. Ie, proof that E(X+Y) = E(X) + E(Y). Where X can be defined as the number of tries to get the 1st toy, Y is the number of tries to get the 2nd toy, and so on: https://www.milefoot.com/math/stat/rv-sums.htm

In this specific example, X and Y are independent random variables. But if my reading of the above proof is correct, you can still sum the expected values even if X and Y happen to be correlated.

vecter · on Feb 22, 2023

Linearity of expectation is of course true in general, but in this case summing the expectations of the individual transition probabilities is only valid because the only valid state transitions are going from state i -> i+1 (owning i unique toys to i+1 unique toys) or just staying in state i. If you could magically go from state i to any other state (you lose unique toys or gain more than one unique toy), then this simple calculation would not be correct.

wizofaus · on Feb 22, 2023

Doesn't it depend on the #/ distribution of coupons printed/toys manufactured though? To take an extreme example if there are 4 different toys and only 10 of each are manufactured, then the probability of getting 4 all the same is surely somewhat lower than it would be if you assume essentially an infinite toy supply where the chance of getting a particular toy is always 1/4.

mlyle · on Feb 22, 2023

We're effectively assuming that the number of toys are infinite and equally likely.

Or, to put it slightly more formally, we're choosing each time from 4 equally likely possibilities, with replacement. Imagine you're drawing from a bag with 4 toys inside. How long until you've drawn each of the 4 toys, if you replace the toy in the bag after each choice?

wizofaus · on Feb 22, 2023

Yes I gathered, but in reality there's not an infinite number all produced in equal quantities. I'm curious how much it might change how many meals you'd need to buy if it was a limited production run, and even whether the total number of participants would make much difference (clearly it must make some - if only 1000 of each toy were produced and 1000 customers took part, surely very few would have been lucky enough to get all 4 toy types at all, as on average they could have only received 4).

mlyle · on Feb 23, 2023

> Yes I gathered, but in reality there's not an infinite number all produced in equal quantities.

Most of the McDonald's toys are made in equal quantities, and the quantity available dwarfs the quantity any one customer buys.

> clearly it must make some - if only 1000 of each toy were produced and 1000 customers took part, surely very few would have been lucky enough to get all 4 toy types at all

I don't think the odds change much in this case. 4 provides you with a very small chance from a bottomless pit -- 3/4 * 1/2 * 1/4 = .09375 chance of getting all 4 with with purchased.

It's slightly better if there's a small number, say, 4 of each, because you remove the ones you don't want from contention.

You have a 12/15 chance of the second not matching (3 toys remain that you don't want; and 4 toys each remain of each of the 3 kinds you want); then a 8/14 chance of the third not matching, and then a 4/13 chance of the fourth one not matching-- or a ~14% chance. So even when we say there's only 4 of each toy, the odds don't change -that- much.

Once we get to 1000 of each, the odds basically are the same as the infinite case. 3000/3999, 2000/3998, 1000/3997 are practically 3/4, 1/2, 1/4 -- ~.09389.

> as on average they could have only received 4

Well, we already know from the infinite case you need to buy more than 4 to have a good shot.

travisjungroth · on Feb 22, 2023

On the first problem, if you’re already going to estimate the answer with finite iterations over an infinite series, I think you should just skip ahead to the simulation that was used to test it.

The header for the test section says “Validation” but that’s not what’s happening. It’s not taking some answer and seeing if it’s correct. It’s coming to the answer in a different way, in a way that’s faster to write, faster to run and easier to understand than the other code.

Easier to extend and modify, too. Uneven weights for the toys? Easy. Want the P99 case? The standard deviation? Easy. Simulate it ten million times and see what comes out.

I’ve just been hyped on simulations over calculations for probabilities lately and this is wonderful confirmation bias.

jrumbut · on Feb 22, 2023

Yeah I think simulations are massively underused.

They might not help you prepare for a black swan type event, but a lot of times we're not ready for very normal problems. I suspect a day or two per year spent doing basic modeling and simulation would benefit a lot of teams, even back of the envelop stuff. Then you can return to the exercise a year later and see if your assumptions were on point or what turned out to be much different than expected and refine the model and parameters of the simulation.

travisjungroth · on Feb 22, 2023

They're actually great for preparing black swan events. Taleb talks about them in "The Black Swan".

Usually the rules of the game aren't as clearly stated as in this problem. Sometimes there is a literal coupon contest, or a node randomly pairs with another node. But usually you're looking at history to try to to understand the odds and then forecast some futures. The problem with doing that with statistics is it's not very receptive to "fiddling", even if it's Bayesian. With a simulation it's generally easier to throw something weird in there, like an effect orders of magnitude larger, dynamic effects, or a totally new rule.

college_physics · on Feb 22, 2023

> They might not help you prepare for a black swan type event

at some point probabilistic simulations were all the rage in the financial industry and then the financial crisis happened

it is a fine art to figure out how to elicit the benefits of Monte Carlo simulation to put some structure around the future without falling into a mental trap that blinds you to non-quantifiable scenarios

LegionMammal978 · on Feb 22, 2023

Depending on how much precision you'd like for your answer, Monte Carlo methods can converge unbearably slowly, since the standard deviation is proportional to 1/sqrt(N) as you add more samples. I recall once having to run 100 billion simulations to get the mean of a certain quantity to within a few decimal places. I'd take an infinite series over that ant day of the week (well, as long as it's not something as slow as the Leibniz series for pi/4).

Of course, such precision relative to the standard deviation is rarely necessary for most practical situations.

dekhn · on Feb 22, 2023

As a friend in grad school used to say... "An infinitely long Monte Carlo simulation recapitulates the underlying probability distribution perfectly, but we don't want to wait that long.":

travisjungroth · on Feb 22, 2023

Yeah, it’s the practicality. Most business decisions don’t need more than three sigfig [source?]. We only tend to think about precision on things that already have it.

0cf8612b2e1e · on Feb 22, 2023

Unless you are trying to prove the existence of a subatomic particle, I am not sure what business decisions have anything close to that level of certainty.

travisjungroth · on Feb 22, 2023

Kinda depends on what you mean by "business decisions". Repetition can bring in a high need for confidence. You don't want to have your life depend on a piece of equipment that has a one in a million chance of failure if it gets tested a thousand times an hour. Pricing things can also need a lot of certainty, like in HFT. But most things aren't either of those cases and people are commonly off by an order of magnitude.

danuker · on Feb 22, 2023

"The Unreasonable Effectiveness of Monte Carlo Methods"

LudwigNagasena · on Feb 23, 2023

"The Unreasonable Effectiveness of reading a statistics textbook." There is enough material for, like, 100 blogposts or maybe even more!

0cf8612b2e1e · on Feb 22, 2023

Just making a simulation and codifying assumptions is hugely helpful. Forcing SMEs to rationalize a number and its spread has so much more utility than a closed form calculation which people have to take on faith.

ketanmaheshwari · on Feb 22, 2023

The unreasonable effectiveness of articles having the phrase "Unreasonable Effectiveness" in their title on HN: https://hn.algolia.com/?q=unreasonable+effectiveness

ddejohn · on Feb 22, 2023

It drives me crazy seeing articles with this title everywhere. Glad I'm not alone.

smcin · on Feb 23, 2023

It's ok when it's merited, but often it's not, this article being one case in point.

wirrbel · on Feb 22, 2023

Conditional probabilities are reasonably effective because they are the mathematical underpinning of probabilities.

They are not unreasonable effective.

It's just that sampling from probability distributions is computationally expensive, which has lead to the development of frequentist statistics (as opposed to Bayesian statistics). Frequentist statistics is unreasonably effective, since it makes so many assumptions that its hard to do a frequentist statistical analysis without violating some of the assumptions the methods have with regards to the data distributions, etc.

tzs · on Feb 23, 2023

Let E(n) be the expected number of meals to collect at least one each of n toys if there are n possible toys available. We want to find E(4).

Note that

  E(4) = 1 + 4/3 E(3)

That's because after the first meal, there are now 3 toys we need to collect. If there were only 3 toys available we would expect to take E(3) meals to get them all. But since there are 4 available, 1/4 of the time we get another one of the toys we got on the first meal, so only 3/4 of the meals are productive toward our goal of collecting the remaining 3 toys, so it takes 4/3 as long.

Similarly,

  E(3) = 1 + 3/2 E(2)
  E(2) = 1 + 2/1 E(1)

If there were only 1 toy available it would take exactly 1 meal to collect it, so

  E(1) = 1
  E(2) = 1 + 2/1 1 = 3
  E(3) = 1 + 3/2 3 = 11/2 = 5 1/2
  E(4) = 1 + 4/3 11/2 = 50/6 = 25/3 = 8 1/3

allanrbo · on Feb 22, 2023

Published 2023-03-18. Futuristic :-)

ineedasername · on Feb 22, 2023

The unreasonable use of the word unreasonable in the title

fiddlerwoaroof · on Feb 22, 2023

The title is a meme like “X considered harmful”

artful314w · on Feb 27, 2023

I don't think this problem needed conditional probabilities. I had a go at coming up with a combinatorics approach and a simulation approach to verify the first attempt. It's here if you're interested: https://medium.com/@benjamin22-314/get-all-four-toys-b8653c2...

jonnycomputer · on Feb 22, 2023

Pet peeve. Just exactly what about it is unreasonable?

renewiltord · on Feb 23, 2023

It's a reference https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

It's interesting that we can start with some stuff, get mathematical conclusions, and then the mathematical conclusions lead to results about Physics. That's crazy! Well, not really, but it's cool, right?

But it's mostly used as a humorous snowclone here. So yeah, it is what it is, and you must remain peeved. There is no relief.

galaxyLogic · on Feb 23, 2023

I think physics starts with some very basic assumptions about the world because that's what the world looks like. So I don't find it surprising that the models physics starts with, and their consequences would already have been investigated by mathematicians.

What is a physics problem if not a math-problem? So I never really understood the point of mathematics being unreasonably effective. What am I missing?

jonnycomputer · on Feb 23, 2023

I know the reference. Hence the pet peeve.

sedundnes · on Feb 22, 2023

For "Game of Stopped Dice" the article states "we would keep our winnings on the first throw only if they are 5 or 6". Would it not be better to cash out even at $4 and play again using those winnings?

gweinberg · on Feb 22, 2023

Only if you were playing with your last four dollars in the world. But in that case you might not want to risk it anyway. Assuming you have more money to risk, you're better off with an expected $4.25 than a sure $4.

scaredginger · on Feb 23, 2023

They may have realised they're playing a bad bet and won't let you play again.

I kid of course, but the implicit aim of these exercises is always to maximise expected value. The underlying assumption is that you can make a lot of independent bets and the results will average out thanks to the law of large numbers. Suppose you were playing this game 1000 times in parallel. Would you rather average zero profit at the end or $667?

scaredginger · on Feb 23, 2023

The first problem is much simpler framed around Bernoulli sequences, each concluding when one gets another distinct toy. Then, it's just the expectation of the sum of four geometric random variables

If you have zero toys, the probability of getting a new distinct toy is 1. With 1, it's 3/4. With 2, it's 1/2. With 3, it's 1/4.

Then, expected number of meals to get all the toys is 1/1 + 1/(3/4) + 1/(1/2) + 1/(1/4)

graycat · on Feb 23, 2023

> The unreasonable effectiveness of conditional probabilities

Yup.

There is a more general result that is more amazing with still more "unreasonable effectiveness":

For events A and B, we can consider the conditional probability of A given B:

P(A|B) = P(A and B)/P(B)

Then we can generalize P(A|B):

Suppose we do an experiment and make two observations, X and Y. We regard each of X and Y as a random variable that takes on real number values.

We assume that X and Y have averages, that is, expectations, and write those as E[X] for the expectation of X and E[Y] for the expectation of Y.

We want to use X to approximate Y.

So, we want a real valued function f of one real variable x and use f(X) to approximate Y.

For the most accurate approximation, we want to minimize the expectation

E[(Y - f(X))^2]

Claim: For f(X) we want

f(X) = E[Y|X]

where E[Y}X} is the conditional expectation of Y given X and a generalization of P(A|B).

Proof:

We start by using one of the properties of conditional expectation and then continue with just simple algebra:

E[(Y - f(X))^2]

= E[ E[(Y - f(X))^2|X] ]

= E[ E[Y^2 - 2Yf(X) + f(X)^2|X] ]

= E[ E[Y^2|X] - 2f(X)E[Y|X] + f(X)^2 ]

= E[ E[Y^2|X] + E[Y|X]^2 - 2f(X)E[Y|X]

+ f(X)^2 - E[Y|X]^2 ]

= E[ E[Y^2|X] + ( E[Y|X] - f(X) )^2 - E[Y|X]^2 ]

which is minimized with

f(X) = E[Y|X]

Done.

So, the "unreasonable effectiveness" of conditional expectation that it provides the best non-linear least squares approximation.

In terms of statistics E[Y|X] is the minimum variance estimator of Y and is also unbiased.

Yes, P(A|B), conditional probability, can be seen as a special case of E[Y|X], conditional expectation.

The detailed definition of E[Y|X] is based on the Radon-Nikodym theorem. W. Rudin's Real and Complex Analysis gives the novel von Neumann proof. Then in Neveu, Mathematical Foundations of the Calculus of Probability is elegant coverage of E[Y|X].

LudwigNagasena · on Feb 23, 2023

> For the most accurate approximation, we want to minimize the expectation

> E[(Y - f(X))^2]

I don't think it's obvious why you would want that. It looks like you smuggle in your conclusion with that assumption.

I would start with establishing that probability space of variables with second moments is a Hilbert space with the inner product of X and Y defined as E[XY], which lends itself to a nice geometrical interpretation. Then we can project Y onto X, hence Y = Y_∥ + Y_⊥ where Y_∥ is an orthogonal projection and Y_⊥ is the leftover part that is orthogonal to X, ie X can perfectly predict Y_∥ but cannot say anything about Y_⊥. So Y_∥ by construction contains all information we can have about Y if we know X. We can also note that orthogonal projection minimizes distance, i.e. it minimizes E[(X-f(Y))^2].

graycat · on Feb 23, 2023

> I don't think it's obvious why you would want that.

To minimize

E[(X-f(Y))^2]

is just common, traditional, nearly universal least squares with all the powers, advantages, and attributes there unto pertaining!

Yes, can also connect the least squares with Hilbert space geometry, but for that you add an assumption not needed with my approach via the Radon-Nikodym result.

LudwigNagasena · on Feb 24, 2023

> To minimize E[(X-f(Y))^2] is just common, traditional, nearly universal least squares with all the powers, advantages, and attributes there unto pertaining

For me that's hardly an intuitive notion. For example, in some econometrics textbooks you can meet the reversed argument: OLS is motivated as a way to estimate conditional expectation function. So one can ask, isn't OLS "just common, traditional, nearly universal least squares with all the powers, advantages, and attributes there unto pertaining" simply because it is a way to estimate conditional expectation? And now you are back to square one.

> you add an assumption not needed with my approach via the Radon-Nikodym result

Existence of a unique minimizer of E[(X-f(Y))^2] is a property of L2. You may carryover many properties to L1 by continuity but it isn't one of them. If you don't assume existence of second moments, it is non-trivial to show that min E[(X-f(Y))^2] makes sense if any. I think that you have to put specific constraints on X and Y, but I don't know which are the least restrictive.

graycat · on Feb 24, 2023

> If you don't assume existence of second moments, it is non-trivial to show that min E[(X-f(Y))^2] makes sense if any.

To show that

E[Y|X]

exists, essentially I did that.

Really, all need from X is its sigmaalgebra. That is, can do the argument conditioning on a sigmaalgrbra instead of a random variable X. For Y, I didn't include the details back to the Radon-Nikodym theorem, but may not even need that E[Y] is finite.

Sure don't need finite "second moments".

LudwigNagasena · on Feb 24, 2023

You assumed that E[Y|X] and E[Y^2|X] exist.

graycat · on Feb 24, 2023

E[Y^2|X] exists?

Doesn't ask for much, merely that Y be measurable, that is, a random variable. We get to f'get about Riemann integration of probability densities.

Full details, with some of the most elegant, powerful, and precise work in the history of civilization, are in the references I gave, to Rudin and Neveu.

LudwigNagasena · on Feb 25, 2023

Uhm. Not every random variable has the second moment. And wasn’t that your objection towards the formulation in terms of Hilbert space? I am a bit lost trying to understand what you are trying to argue.

graycat · on Feb 26, 2023

For a real random variable Y, E[Y^2] always exists although might be infinity.

graycat · on Feb 25, 2023

Moments and Hilbert space need have nothing to do with real valued function f of real variable x minimizing

     E[ (Y - f(X)^2 ]

for real valued random variables X and Y. For Y it is enough that it has an expectation. For X, it need only be measurable, that is, a random variable.

I referenced the Radon-Nikodym theorem. Assuming that, the definition and basic properties of

     E[Y|X]

are immediate.

Let's back up to something simpler: What real number a minimizes

     E[ (Y - a)^2 ]
     = E[ Y^2 - 2aY + a^2 ]
     = E[Y^2] - E[2aY] + E[a^2]
     = E[Y^2] - 2aE[Y] + E[a^2]
     = E[Y^2] + E[Y]^2 - 2aE[Y]
       - E[Y]^2 + E[a^2]
     = E[Y^2] + E[Y]^2 - 2aE[Y] - E[Y]^2 + a^2
     = E[Y^2] + (E[Y] - a)^2 - E[Y]^2

which is minimized by

     a = E[Y]

So, if real random variable

     X = a

that is, takes on only one value, then

     E[ (Y - a)^2 ]

is minimized by

     a = E[Y]

To improve on that case, suppose real random variable X is discrete, takes on finitely many values

a_i

for positive integer n and

     i = 1, 2, ..., n

and suppose random variables

                    / X = a_i
     1_{X = a_i} =  |
                    \ 0 otherwise

Then

     X = sum_{i = 1}^n a_i 1_{X = a_i}

Let

     b_i = E[Y 1_{X = a_i}]

Then by what we did above for

     a = E[Y]

minimizing

     E[ (Y - a)^2 ]

we have that

     f(X) = sum_{i = 1}^n b_i 1_{X = a_i}

minimizes

     E[(Y - f(X))^2]

And we would have

     E[Y|X] = f(X)

Basically we are doing classic cross tabulation: We collect data on both X and Y. When X = a_i, we average the corresponding Y values and get b_i for our estimate of Y.

The Radon-Nikodym result, among other things, lets us do the same when X is not discrete, that is, is any random variable. We don't even need for X to have an expectation -- all we want is measurability so that from X we have a sigma algebra.

The Radon-Nikodym result is part of measure theory, with a novel von Neumann proof in Rudin. The connection with probability and conditional expectation is part of graduate probability as in Neveu.

My all-time favorite course in school was the graduate probability course I took from Neveu, etc.

I never wanted to be a college professor, but for some unusual reasons for some years I was. Now I'm happy that I'm no longer a college professor!

I've published some peer-reviewed papers of my original research in pure/applied math. I don't like publishing and don't want to publish. In particular, I don't want to write a lecture in graduate probability here on Hacker News.

But you can see above that I made no use of moments.

This thread started with an observation that conditional probability is amazing. Then, easy enough, I mentioned that conditional probability is a special case of conditional expectation which has an amazing minimization property.

In addition, conditional expectation is the core of both Markov processes and martingales. The martingale convergence theorem is astounding.

My real interest now is my Internet startup -- back to it.

LudwigNagasena · on Feb 26, 2023

I am not sure what you mean by moments, but in probability theory “the n-th moment” means E[X^n]; and you’ve clearly used them.

In probability theory the Radon-Nikodym is used to justify existence of random variables. But it cannot be used to show that any variable has an expected value (ie has a moment, ie is integrable). The Cauchy distribution is a traditional counter example to such assertion.

> For Y it is enough that it has an expectation

That is false. Just look at your “simple example”. You arrive to E[Y^2]. An expectation is not enough. It should be clear intuitively because MSE is a minimization of the second moment of a certain rv, so you are bound to have at least some constraints for second-order relationships for the variables you’ve used to construct that certain rv.

I think you may need to re-read Rudin and Neveau and refresh your knowledge. (Though I can say nothing about Neveau, I studied using the textbook by Shao.)

graycat · on Feb 26, 2023

For any real valued random variable,

     E[Y^2]

always exists although it might be infinity.

You mentioned moments -- as I recall some writing in statistics uses that word, but I never have.

Your description of the Radon-Nikodym (R-N) result does not look right. Again, I don't want to write a lecture in graduate probability, but from memory the usual description of that result goes:

Given real valued, integrable function f and sigma algebra s, there exists real value function g measurable with respect to sigma algebra s so that g integrates like f over all sets in s. So, with real valued random variable Y in place of function f, we let

     E[Y|s]

be the function g. If s is the sigma algebra generated by real random variable X, then we write

     g(X) = E[Y|X] = E[Y|s]

So the R-N result makes immediate the definition of E[Y|X] and shows that it exists.

If you want a perfect, polished presentation, see the two references I gave, Rudin and Neveu.

My main contribution here remains:

     f(X) = E[Y|X]

minimizes

     E[ (Y - f(X))^2 ]

and is a minimum variance, unbiased estimator of Y. That's a generalization of the start of this thread and a bit amazing.

LudwigNagasena · on Feb 26, 2023

> Given real valued, integrable function f

Of course that makes it possible. Note that it doesn’t say that f is simply measurable. By definition being integrable means E[|f|] < infinity. (And being square-integrable means E[f^2] < infinity, an even stronger condition).

Consider Y = Z/X where Z, X iid ~ N(0,1). Try to calculate E[Y|X], and try to minimize E[ (Y - f(X))^2 ]. Neither will make sense. I am sure both Rudin and Neveu were aware of this trivial result.

graycat · on March 1, 2023

> Of course that makes it possible. Note that it doesn’t say that f is simply measurable.

Irrelevant.

Again, once again, over again, yet again, for real valued random variable Y,

     E[Y^2]

always exists.

> By definition being integrable means E[|f|] < infinity.

Wrong.

Correcting you is a waste of my time.

LudwigNagasena · on March 1, 2023

You are not correcting me, you are correcting Rudin :)

jesuslop · on Feb 22, 2023

Alternatively my take is that while a function takes a thing of the domain and spits out another thing of the codomain, a conditional probability takes a thing of the domain and spits out a distribution on the things of the codomain. With P(c=C|d=D) with D=domain and C=codomain. With finite C and D that suffices. If the distributions are 1-hot you get a function.

__anon-2023__ · on Feb 22, 2023

For the collectible toy problem, how much money would you save if you paired up with someone else and purchased meals until you both had all 4 toys? My intuition says going from 1 collector to 2 collectors would help significantly (like queuing theory).

kqr · on March 3, 2023

Hm. I'm sad nobody answered this. I'm also on my phone and in a bad position to give an exact answer. However, rolling a few scenarios[1] and counting manually, I get 16 meals on average, with a standard error of 5. So not conclusive by any stretch, but it seems like it's roughly as expensive to be two people as it is to be one person.

[1]: https://www.random.org/integers/?num=24&min=1&max=4&col=4&ba...

kqr · on March 3, 2023

Oops miscalculation -- mean 14, standard error around 1. So at this point I can fairly confidently say it helps to pool the effort of two people, though it helps maybe less than one would think coming from queueing theory.

nickponline · on Feb 22, 2023

I wrote a blog about the general case https://nickp.svbtle.com/general-collectors-problem

gnulinux · on Feb 22, 2023

Other commenters made other comments, but I also want to note that this article is terrible for another reason. For the first problem, they computed mathematically all the way up to calculating the recursive expression they got. After that they moved to computing the value with Python. The reason this is bad is because writing the Python code to calculate ~8.3564 was easy in the first place. They made basic mathematical analysis that wasn't even necessary, then moved to code to actually solve the problem. To be instructional, they could compute 8.333 solution mathematically (as e.g. dgacmu or vecter in this thread did) and then later move to Python for an alternative solution. This is not only confusing but also incomplete for both approaches.

Hitton · on Feb 22, 2023

The first problem is known as "Coupon collector's problem".

froh · on Feb 22, 2023

> published 2023-03-18

straight from the future?

photochemsyn · on Feb 22, 2023

While I wouldn't dream of posting ChatGPT content, here are some questions you could give to ChatGPT to a make this all a fair bit clearer:

First, what is an expectation value, as defined in the field of statistics?

Second, is there some reason an expectation value wouldn't behave like a strange attractor from dynamical systems theory and sort of wander about, rather than settle down to a precise value?

Third, what's the connection (if any) between conditional probability and expectation values?

For the first example, a more interesting real world-problem would be one where the total number of different outcomes (toys) is unknown - say, counting the number of species of beetle that exist on planet Earth by steadily sampling all the habitats one can find - how long would the tails be?

Variations of the latter examples might be odds that vary over time in conjunction with costs-of-entry that vary over time - what's the acceptable range of odds & costs to justify playing with some expectation of at least coming out even? That's a bit more like real-world markets.

throwaway81523 · on Feb 22, 2023

I think you were downvoted (not by me) because your questions, except for the one about strange attractors, are about basic topics in probability theory. So you should probably read a book or take a class. The one about strange attractors is more interesting and maybe not studied enough, but you could look at https://en.wikipedia.org/wiki/El_Farol_Bar_problem for a related example.

seanhunter · on Feb 23, 2023

Why would chatGPT give you answers to these questions that you could trust to be correct and not give you a plausible-sounding but wrong answer and thereby a false sense of confidence?