Well... what do you mean by "a stable process" in this context?
Let's try repeating the same example, but now drawing samples from a fixed distribution (in this case, a log-normal distribution):
> data <- exp(rnorm(100))
> sum(as.numeric(data > mean(data)))/length(data)
[1] 0.32
So, again, quite far from a 50/50 split, even though I am assuming a stable/fixed data generation process.
In general, it would help if statistical subjects are not presented in a careless way (i.e., containing things which are obviously not true). I would suggest at least adding an "assuming a symmetrical distribution" (so that at least your claim is approximately correct under the arithmetic average and for bounded variance distributions).
EDIT: If by "a stable process" you mean "a process following a stable distribution"... then, no, it doesn't help.
Here's an example with samples drawn from a Lévy distribution (which is a stable distribution):
Now this is getting interesting. I did not think there existed a data set with these properties!
I don't have a computer at hand, but if you bootstrap from that population, in how many cases are the XmR limits violated? If it's more than, say, 15 %, I would not consider that distribution stable in the SPC sense, and thus not really a counter-example.
> I don't have a computer at hand, but if you bootstrap from that population, in how many cases are the XmR limits violated? If it's more than, say, 15 %, I would not consider that distribution stable in the SPC sense, and thus not really a counter-example.
Still sounds a bit like goalpost moving to me... now I need to perform bootstraps (and change the order of the samples arbitrarily) to even define if a distribution in "stable" (i.e., stationary) or not?
Either way, I think my original point still stands: different "averages" have different properties, and the claim that arbitrary "averages" will be good estimates of the population median (without invoking anything regarding distributional symmetry) seems rather unfounded.
Of course, if you start adding terms like "roughly", and then extend its meaning so that "30 is roughly 70" (even though 30 is less than half of 70), then I guess any "average" (since, by definition, it exists between max(data) and min(data)) will be some sort of some "rough" estimate of the median, at least to some orders of magnitude (since it also, by definition, exists between max(data) and min(data) as well), sure.
I'm still not reading the rest of the article posted, though. I remind you that what was written did not mention "stability" in any way. It simply said:
> Again, you and I know better. A statistic known as “average” is intentionally designed to fall in the middle of the range. Roughly half of your measurements will be above average, and the other half below it.
This, as it is written, is sloppy. And I rather not ready something sloppy.
Okay, that makes sense. As the author I intentionally write beginner material with some slop to convey the intuition rather than exact lemmas. This is not what you're looking for and that's fine.
I still will keep your criticism at the back of my head and be more wary about sweeping generalisations going forward. Thanks.
It would be nice if someone thought of all edge cases and wrote a formally correct treatment, though! (The statistician's version rather than the practitioner's version, I suppose.)
I'll just leave a final comment: if you restrict yourself to the arithmetic mean, then you can use Cantelli's inequality to make some claims about the distance between the expectation and the median of a random variable in a way that only depends on the variance/st.dev.
On the other hand, you do not actually know the (population) expectation or (population) variance: you can only estimate them, given some samples (and, quite often, they can be undefined/unbounded).
Also, as I was trying to demonstrate in my previous comment, most "averages" are poor estimators for the expectation of a random variable (compared to the arithmetic sample mean), the same way that min(data) or max(data) are poor estimators for the expectation of a random variable, so it seems a bit "dangerous" to make such a general broad claim (again, in my humble opinion).
I was not aware you were the author. I apologize if anything in my delivery came across as harsh.
I would just suggest considering whether the "any (sample) average is a rough approximation of the (population) median" is a necessary claim in your exposition (particularly as it is stated).
Given this is supposed to be "beginner material", it would seem important not to say something that can mislead beginners and give them an incorrect intuition about "averages" (in my humble opinion). Note that adding the "but only for 'stable' distributions" caveat doesn't really solve things, since that term is not clearly defined and begginers would certainly not know what it means a priori.
I know this may came across as pedantic or nitpicky, but I would really like you to understand why such a general statement, technically, cannot possible be true (unless you really extend the meaning of "roughly"). When I read what is written, I see two claims, in fact (marked between curly braces):
> A statistic known as “average” is intentionally {designed to fall in the middle of the range}. {Roughly half of your measurements will be above average, and the other half below it}.
The first claim suggests that any average approximates the "midrange" (i.e., 0.5*(max(data)+min(data)), a point that minimizes the L_inf norm w.r.t. your data points). The second claim suggests that any average approximates the "median" (i.e., a point that minimizes the L_1 norm w.r.t. your data points).
The main problem here, as I see it, is that there is an infinite number of different possible means, densely convering the space between min(data) and max(data). Thus, unless you are ok claiming that both min(data) and max(data) are reasonable rough estimates of the median and the midrange, you should avoid such strong and general claim (in my humble opinion).
Either way... I lied... I did read some of the rest, and some of it was interesting (particularly the part about the magic constant), but the lack of formal correctness in a few claims did put me off from reading through all of it.
Once again, have a nice day, and please don't be discouraged by the harshness of my comments.
I really do appreciate the criticism. You're factually correct, of course!
I also see now that statement about means comes off as more definitive than I meant it to be. When I find the time to I will try to soften the wording and make it clear that it's not strictly true.
I think GPs point is that there are software processes that are fundamentally stable but still generate values like that. I'm in the process of writing an article on that topic, because it annoys me I don't have a good answer.
In this context "stable" means the thing it means in statistical process control, i.e. the operational definition of no measurements outside of 2.66 times the mean consecutive difference between observations.
It is a problem -- particularly for software -- that SPC tools do not work with subexponential distributions, but it's separate from the observation that when SPC determines that a process is stable, rougly half of measurements will lie above the average.
To be fair to OP, Wheeler never claims that for stable/in-control/predictable processes roughly half of the measurements will lie above the average. The only claim he makes is that 97% of all data points for a stable process (assuming the process draws from a J-curve or single-mound distribution) will fall between the limit lines.
He can't make this claim (about ~half falling above/below the average line), because one of the core arguments he makes is that XmR charts are usable even when you're not dealing with normal distributions. He argues that the intuition behind how they work is that they detect the presence of more than one probability distribution in the variation of a time series.
I don't have the stats-fu to back it up but I would be very surprised if someone could point to a process where XmR charts are useful, but where the mean is not within 10–20 percentiles of the median.
> the operational definition of no measurements outside of 2.66 times the mean consecutive difference between observations
Not even a simple Gaussian distribution can hold up to this standard of "stability" (unless I understood incorrectly what you mean here):
> data <- rnorm(1000) # i.i.d. normal data
> mcd <- 2.66*mean(abs(diff(data))) # mean consecutive difference * 2.66
> sum(as.numeric(abs(data) > mcd))/length(data) # fraction of bad points
[1] 0.002
Unless you are willing to add additional conditions (e.g., symmetry), I still don't see how criteria that pertain to variance and kurtosis (e.g., "the operational definition of no measurements outside of 2.66 times the mean consecutive difference between observations") can imply any strong relationship between the (sample) arithmetic mean (or any other mean) and the (population) median.
In fact, even distributions for which the "arithmetic mean is approximately equal to the median" claim is roughly correct will almost certainly not display the same property when you use some other mean (e.g., geometric or harmonic mean).
Either way, if you have some reference that supports the stated claim, I will be very happy to take a look at it (and educate myself in the process).