Hacker News new | past | comments | ask | show | jobs | submit login

Would it help you get past that point if it had said "of a stable process"?



Well... what do you mean by "a stable process" in this context?

Let's try repeating the same example, but now drawing samples from a fixed distribution (in this case, a log-normal distribution):

> data <- exp(rnorm(100))

> sum(as.numeric(data > mean(data)))/length(data)

[1] 0.32

So, again, quite far from a 50/50 split, even though I am assuming a stable/fixed data generation process.

In general, it would help if statistical subjects are not presented in a careless way (i.e., containing things which are obviously not true). I would suggest at least adding an "assuming a symmetrical distribution" (so that at least your claim is approximately correct under the arithmetic average and for bounded variance distributions).

EDIT: If by "a stable process" you mean "a process following a stable distribution"... then, no, it doesn't help.

Here's an example with samples drawn from a Lévy distribution (which is a stable distribution):

> data <- rmutil::rlevy(100)

> sum(as.numeric(data > mean(data)))/length(data)

[1] 0.07


Draw an control chat with your data. Here is one for your initial example: http://beza1e1.tuxen.de/spc?data=1,2,3,4,5,6,7,8,9,10,100

It is immediately obvious that it is not a stable process.


Ok. So, let's try again...

First, let's draw some data and show that the sample mean is not roughly close to the sample median:

> data <- asinh(exp(rnorm(30)))

> mean(data)

[1] 1.227973

> median(data)

[1] 1.059046

> sum(as.numeric(data > mean(data)))/length(data)

[1] 0.4

Now you can take the dataset and make a control chart with it:

> cat(paste0(data,collapse=","))

0.397291781860484,1.01791591678607,2.29127398581896,0.317548192016798,0.825779972770721,0.978034869623426,1.45689922574378,0.722379000865545,2.68231132467641,0.786713029297768,1.20492120955161,1.7082762081373,0.632259821911453,0.346590855307735,2.48238023470879,0.0989934260605276,1.61233320755675,0.906918026775941,1.73743152912329,1.21715325934946,2.78776306537914,0.296838101056961,2.29303061152949,2.65277514999252,0.88486942904647,0.0860402641329708,1.255123685342,0.526097043743278,1.53307173756969,1.10017654868633

https://beza1e1.tuxen.de/spc?data=0.397291781860484,1.017915...

Looks like the data is "stable" (according to that website) and, yet, you get a 40/60 split, rather than a 50/50 split.


I guess the next step in our discussion is if 40/60 is „roughly half“?


Oh. I didn't understand that "roughly" was supposed to be doing all the work, in this context.

I guess a 30/70 split is also "roughly" a 50/50 split, right?

Example:

> cat(paste0(data,collapse=","))

0.194377163769996,0.0102070939265764,0.309119108211189,0.0120786780598317,1.45982220742052,0.00158028772404075,0.035004295275816,0.329022291919098,-0.00635736453948977,0.0158683345454085,1.19240981895862,-0.0127659220845804,-0.00650696353310367,0.00716707476017206,1.85868411217008,0.374960693228966,0.114533107998102,0.591872380192402,0.469305862127421,0.60161713700353,0.000421158731442352,0.265325485949535,-0.00279113302976559,0.0168217608051942,-0.00654584643918818,0.701343388607726,1.84387017506994,-0.00461644730360566,0.0781831777299275,1.05989990859088

https://beza1e1.tuxen.de/spc?data=0.194377163769996,0.010207...

> sum(as.numeric(data > mean(data)))/length(data)

[1] 0.3

> mean(data)

[1] 0.3834637

> median(data)

[1] 0.09635814

Would you say that the mean and the median are "roughly" the same, in this context? I'm curious...


Now this is getting interesting. I did not think there existed a data set with these properties!

I don't have a computer at hand, but if you bootstrap from that population, in how many cases are the XmR limits violated? If it's more than, say, 15 %, I would not consider that distribution stable in the SPC sense, and thus not really a counter-example.

Edit: I found a computer with R:

> mean(replicate(5000, signal(sample(xx, length(xx), replace=T))))

[1] 0.514

This implies a false positive rate of 98 %, so I'd reconsider using XmR charts with this distribution.


> I don't have a computer at hand, but if you bootstrap from that population, in how many cases are the XmR limits violated? If it's more than, say, 15 %, I would not consider that distribution stable in the SPC sense, and thus not really a counter-example.

Still sounds a bit like goalpost moving to me... now I need to perform bootstraps (and change the order of the samples arbitrarily) to even define if a distribution in "stable" (i.e., stationary) or not?

Either way, I think my original point still stands: different "averages" have different properties, and the claim that arbitrary "averages" will be good estimates of the population median (without invoking anything regarding distributional symmetry) seems rather unfounded.

Of course, if you start adding terms like "roughly", and then extend its meaning so that "30 is roughly 70" (even though 30 is less than half of 70), then I guess any "average" (since, by definition, it exists between max(data) and min(data)) will be some sort of some "rough" estimate of the median, at least to some orders of magnitude (since it also, by definition, exists between max(data) and min(data) as well), sure.

I'm still not reading the rest of the article posted, though. I remind you that what was written did not mention "stability" in any way. It simply said:

> Again, you and I know better. A statistic known as “average” is intentionally designed to fall in the middle of the range. Roughly half of your measurements will be above average, and the other half below it.

This, as it is written, is sloppy. And I rather not ready something sloppy.

Have a nice day, though.


Okay, that makes sense. As the author I intentionally write beginner material with some slop to convey the intuition rather than exact lemmas. This is not what you're looking for and that's fine.

I still will keep your criticism at the back of my head and be more wary about sweeping generalisations going forward. Thanks.

It would be nice if someone thought of all edge cases and wrote a formally correct treatment, though! (The statistician's version rather than the practitioner's version, I suppose.)


I'll just leave a final comment: if you restrict yourself to the arithmetic mean, then you can use Cantelli's inequality to make some claims about the distance between the expectation and the median of a random variable in a way that only depends on the variance/st.dev.

See: https://en.wikipedia.org/wiki/Chebyshev%27s_inequality#Cante...

On the other hand, you do not actually know the (population) expectation or (population) variance: you can only estimate them, given some samples (and, quite often, they can be undefined/unbounded).

Also, as I was trying to demonstrate in my previous comment, most "averages" are poor estimators for the expectation of a random variable (compared to the arithmetic sample mean), the same way that min(data) or max(data) are poor estimators for the expectation of a random variable, so it seems a bit "dangerous" to make such a general broad claim (again, in my humble opinion).


I was not aware you were the author. I apologize if anything in my delivery came across as harsh.

I would just suggest considering whether the "any (sample) average is a rough approximation of the (population) median" is a necessary claim in your exposition (particularly as it is stated).

Given this is supposed to be "beginner material", it would seem important not to say something that can mislead beginners and give them an incorrect intuition about "averages" (in my humble opinion). Note that adding the "but only for 'stable' distributions" caveat doesn't really solve things, since that term is not clearly defined and begginers would certainly not know what it means a priori.

I know this may came across as pedantic or nitpicky, but I would really like you to understand why such a general statement, technically, cannot possible be true (unless you really extend the meaning of "roughly"). When I read what is written, I see two claims, in fact (marked between curly braces):

> A statistic known as “average” is intentionally {designed to fall in the middle of the range}. {Roughly half of your measurements will be above average, and the other half below it}.

The first claim suggests that any average approximates the "midrange" (i.e., 0.5*(max(data)+min(data)), a point that minimizes the L_inf norm w.r.t. your data points). The second claim suggests that any average approximates the "median" (i.e., a point that minimizes the L_1 norm w.r.t. your data points).

The main problem here, as I see it, is that there is an infinite number of different possible means, densely convering the space between min(data) and max(data). Thus, unless you are ok claiming that both min(data) and max(data) are reasonable rough estimates of the median and the midrange, you should avoid such strong and general claim (in my humble opinion).

Note: you can choose a "generalized mean" that is arbitrarily close to min(data) or arbitrarily close to max(data); for example, see https://en.wikipedia.org/wiki/Generalized_mean

Either way... I lied... I did read some of the rest, and some of it was interesting (particularly the part about the magic constant), but the lack of formal correctness in a few claims did put me off from reading through all of it.

Once again, have a nice day, and please don't be discouraged by the harshness of my comments.


I really do appreciate the criticism. You're factually correct, of course!

I also see now that statement about means comes off as more definitive than I meant it to be. When I find the time to I will try to soften the wording and make it clear that it's not strictly true.


I think GPs point is that there are software processes that are fundamentally stable but still generate values like that. I'm in the process of writing an article on that topic, because it annoys me I don't have a good answer.


In this context "stable" means the thing it means in statistical process control, i.e. the operational definition of no measurements outside of 2.66 times the mean consecutive difference between observations.

It is a problem -- particularly for software -- that SPC tools do not work with subexponential distributions, but it's separate from the observation that when SPC determines that a process is stable, rougly half of measurements will lie above the average.


To be fair to OP, Wheeler never claims that for stable/in-control/predictable processes roughly half of the measurements will lie above the average. The only claim he makes is that 97% of all data points for a stable process (assuming the process draws from a J-curve or single-mound distribution) will fall between the limit lines.

He can't make this claim (about ~half falling above/below the average line), because one of the core arguments he makes is that XmR charts are usable even when you're not dealing with normal distributions. He argues that the intuition behind how they work is that they detect the presence of more than one probability distribution in the variation of a time series.

Some links below:

Arguments for non-normality:

https://spcpress.com/pdf/DJW220.pdf

https://www.spcpress.com/pdf/DJW354.Sep.19.The%20Normality-M...

Claim of homogeneity detection:

https://www.spcpress.com/pdf/DJW204.pdf


I don't have the stats-fu to back it up but I would be very surprised if someone could point to a process where XmR charts are useful, but where the mean is not within 10–20 percentiles of the median.


> the operational definition of no measurements outside of 2.66 times the mean consecutive difference between observations

Not even a simple Gaussian distribution can hold up to this standard of "stability" (unless I understood incorrectly what you mean here):

> data <- rnorm(1000) # i.i.d. normal data

> mcd <- 2.66*mean(abs(diff(data))) # mean consecutive difference * 2.66

> sum(as.numeric(abs(data) > mcd))/length(data) # fraction of bad points

[1] 0.002

Unless you are willing to add additional conditions (e.g., symmetry), I still don't see how criteria that pertain to variance and kurtosis (e.g., "the operational definition of no measurements outside of 2.66 times the mean consecutive difference between observations") can imply any strong relationship between the (sample) arithmetic mean (or any other mean) and the (population) median.

In fact, even distributions for which the "arithmetic mean is approximately equal to the median" claim is roughly correct will almost certainly not display the same property when you use some other mean (e.g., geometric or harmonic mean).

Either way, if you have some reference that supports the stated claim, I will be very happy to take a look at it (and educate myself in the process).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: