Draw an control chat with your data. Here is one for your initial example: http:...

fjkdlsjflkds · 2024-03-08T10:57:15 1709895435

Ok. So, let's try again...

First, let's draw some data and show that the sample mean is not roughly close to the sample median:

> data <- asinh(exp(rnorm(30)))

> mean(data)

[1] 1.227973

> median(data)

[1] 1.059046

> sum(as.numeric(data > mean(data)))/length(data)

[1] 0.4

Now you can take the dataset and make a control chart with it:

> cat(paste0(data,collapse=","))

0.397291781860484,1.01791591678607,2.29127398581896,0.317548192016798,0.825779972770721,0.978034869623426,1.45689922574378,0.722379000865545,2.68231132467641,0.786713029297768,1.20492120955161,1.7082762081373,0.632259821911453,0.346590855307735,2.48238023470879,0.0989934260605276,1.61233320755675,0.906918026775941,1.73743152912329,1.21715325934946,2.78776306537914,0.296838101056961,2.29303061152949,2.65277514999252,0.88486942904647,0.0860402641329708,1.255123685342,0.526097043743278,1.53307173756969,1.10017654868633

https://beza1e1.tuxen.de/spc?data=0.397291781860484,1.017915...

Looks like the data is "stable" (according to that website) and, yet, you get a 40/60 split, rather than a 50/50 split.

qznc · 2024-03-08T11:28:54 1709897334

I guess the next step in our discussion is if 40/60 is „roughly half“?

fjkdlsjflkds · 2024-03-08T13:23:08 1709904188

Oh. I didn't understand that "roughly" was supposed to be doing all the work, in this context.

I guess a 30/70 split is also "roughly" a 50/50 split, right?

Example:

> cat(paste0(data,collapse=","))

0.194377163769996,0.0102070939265764,0.309119108211189,0.0120786780598317,1.45982220742052,0.00158028772404075,0.035004295275816,0.329022291919098,-0.00635736453948977,0.0158683345454085,1.19240981895862,-0.0127659220845804,-0.00650696353310367,0.00716707476017206,1.85868411217008,0.374960693228966,0.114533107998102,0.591872380192402,0.469305862127421,0.60161713700353,0.000421158731442352,0.265325485949535,-0.00279113302976559,0.0168217608051942,-0.00654584643918818,0.701343388607726,1.84387017506994,-0.00461644730360566,0.0781831777299275,1.05989990859088

https://beza1e1.tuxen.de/spc?data=0.194377163769996,0.010207...

> sum(as.numeric(data > mean(data)))/length(data)

[1] 0.3

> mean(data)

[1] 0.3834637

> median(data)

[1] 0.09635814

Would you say that the mean and the median are "roughly" the same, in this context? I'm curious...

kqr · 2024-03-08T15:44:46 1709912686

Now this is getting interesting. I did not think there existed a data set with these properties!

I don't have a computer at hand, but if you bootstrap from that population, in how many cases are the XmR limits violated? If it's more than, say, 15 %, I would not consider that distribution stable in the SPC sense, and thus not really a counter-example.

Edit: I found a computer with R:

> mean(replicate(5000, signal(sample(xx, length(xx), replace=T))))

[1] 0.514

This implies a false positive rate of 98 %, so I'd reconsider using XmR charts with this distribution.

fjkdlsjflkds · 2024-03-08T20:50:41 1709931041

> I don't have a computer at hand, but if you bootstrap from that population, in how many cases are the XmR limits violated? If it's more than, say, 15 %, I would not consider that distribution stable in the SPC sense, and thus not really a counter-example.

Still sounds a bit like goalpost moving to me... now I need to perform bootstraps (and change the order of the samples arbitrarily) to even define if a distribution in "stable" (i.e., stationary) or not?

Either way, I think my original point still stands: different "averages" have different properties, and the claim that arbitrary "averages" will be good estimates of the population median (without invoking anything regarding distributional symmetry) seems rather unfounded.

Of course, if you start adding terms like "roughly", and then extend its meaning so that "30 is roughly 70" (even though 30 is less than half of 70), then I guess any "average" (since, by definition, it exists between max(data) and min(data)) will be some sort of some "rough" estimate of the median, at least to some orders of magnitude (since it also, by definition, exists between max(data) and min(data) as well), sure.

I'm still not reading the rest of the article posted, though. I remind you that what was written did not mention "stability" in any way. It simply said:

> Again, you and I know better. A statistic known as “average” is intentionally designed to fall in the middle of the range. Roughly half of your measurements will be above average, and the other half below it.

This, as it is written, is sloppy. And I rather not ready something sloppy.

Have a nice day, though.

kqr · 2024-03-09T06:38:03 1709966283

Okay, that makes sense. As the author I intentionally write beginner material with some slop to convey the intuition rather than exact lemmas. This is not what you're looking for and that's fine.

I still will keep your criticism at the back of my head and be more wary about sweeping generalisations going forward. Thanks.

It would be nice if someone thought of all edge cases and wrote a formally correct treatment, though! (The statistician's version rather than the practitioner's version, I suppose.)

fjkdlsjflkds · 2024-03-09T11:52:28 1709985148

I'll just leave a final comment: if you restrict yourself to the arithmetic mean, then you can use Cantelli's inequality to make some claims about the distance between the expectation and the median of a random variable in a way that only depends on the variance/st.dev.

See: https://en.wikipedia.org/wiki/Chebyshev%27s_inequality#Cante...

On the other hand, you do not actually know the (population) expectation or (population) variance: you can only estimate them, given some samples (and, quite often, they can be undefined/unbounded).

Also, as I was trying to demonstrate in my previous comment, most "averages" are poor estimators for the expectation of a random variable (compared to the arithmetic sample mean), the same way that min(data) or max(data) are poor estimators for the expectation of a random variable, so it seems a bit "dangerous" to make such a general broad claim (again, in my humble opinion).

fjkdlsjflkds · 2024-03-09T09:37:47 1709977067

I was not aware you were the author. I apologize if anything in my delivery came across as harsh.

I would just suggest considering whether the "any (sample) average is a rough approximation of the (population) median" is a necessary claim in your exposition (particularly as it is stated).

Given this is supposed to be "beginner material", it would seem important not to say something that can mislead beginners and give them an incorrect intuition about "averages" (in my humble opinion). Note that adding the "but only for 'stable' distributions" caveat doesn't really solve things, since that term is not clearly defined and begginers would certainly not know what it means a priori.

I know this may came across as pedantic or nitpicky, but I would really like you to understand why such a general statement, technically, cannot possible be true (unless you really extend the meaning of "roughly"). When I read what is written, I see two claims, in fact (marked between curly braces):

> A statistic known as “average” is intentionally {designed to fall in the middle of the range}. {Roughly half of your measurements will be above average, and the other half below it}.

The first claim suggests that any average approximates the "midrange" (i.e., 0.5*(max(data)+min(data)), a point that minimizes the L_inf norm w.r.t. your data points). The second claim suggests that any average approximates the "median" (i.e., a point that minimizes the L_1 norm w.r.t. your data points).

The main problem here, as I see it, is that there is an infinite number of different possible means, densely convering the space between min(data) and max(data). Thus, unless you are ok claiming that both min(data) and max(data) are reasonable rough estimates of the median and the midrange, you should avoid such strong and general claim (in my humble opinion).

Note: you can choose a "generalized mean" that is arbitrarily close to min(data) or arbitrarily close to max(data); for example, see https://en.wikipedia.org/wiki/Generalized_mean

Either way... I lied... I did read some of the rest, and some of it was interesting (particularly the part about the magic constant), but the lack of formal correctness in a few claims did put me off from reading through all of it.

Once again, have a nice day, and please don't be discouraged by the harshness of my comments.

kqr · 2024-03-13T06:18:03 1710310683

I really do appreciate the criticism. You're factually correct, of course!

I also see now that statement about means comes off as more definitive than I meant it to be. When I find the time to I will try to soften the wording and make it clear that it's not strictly true.

kqr · 2024-03-08T10:34:53 1709894093

I think GPs point is that there are software processes that are fundamentally stable but still generate values like that. I'm in the process of writing an article on that topic, because it annoys me I don't have a good answer.