I confess that I found this article unhelpful. There are interesting tidbits in there, but I don't think it helped me identify any specific errors you'd reach by using a standard deviation rather than mean average deviation. The closest it came was:
"1) MAD is more accurate in sample measurements, and less volatile than STD since it is a natural weight whereas standard deviation uses the observation itself as its own weight, imparting large weights to large observations, thus overweighing tail events."
More accurate how? Less volatile, not overweighing tail events: what inference would I make incorrectly by using the standard deviation?
To be clear, I'm not arguing "for" standard deviation, I'm just saying that I wish this article had said more about why it's potentially misleading/less powerful.
I agree. For both the MAD and STD, we are trying to reduce information about the "spread" of a distribution to a single number. Any such reduction must lose information, so you should pick whichever quantity is suitable for your needs.
E.g., in the article they mention that the Pareto distribution has finite MAD but infinite variance. This is meant to be an argument against using the STD, but actually the infinite variance tells us something really important that the MAD does not: the law of large numbers /does not apply/ for the Pareto distribution!
I think the real message should be to avoid blindly applying techniques and tools (especially formal ones) without thinking about why or what they capture.
Taleb is exasperating. Pareto-Levy distributions are statistical nihilism the way Taleb talks about them.
Data is very often approximately normal. Or can be approximated with something like Student-T. That includes estimators for volatility in stock returns. If you assume your risk profile can be characterized with standard deviation, well, you're an asshole. It also can't be characterized with MAD.
Then you have stuff like this: "MAD is more accurate in sample measurements" -what does this even mean?
Thank you for saying this! I'm just a struggling armchair intellectual, but it seems to me like every half year Taleb comes up with something to loudly hand-wring about, something that nobody else gives a damn about because they're not in the attention whoring business.
No, he's completely legitimate actually. He's just so far ahead of people that they can't tell. There's a good Kahneman quote saying he's top 100 intellectuals.
Bingo bingo bingo. Reducing the spread of a distribution to a single number is correct for very special distributions. Beyond these special distributions you have to do more study, take more measurements, do more simulations to understand what you have underlying your mean or median.
Standard deviation is not the best terminology to use because it sounds like it's referring to the mean average deviation (MAD) rather than the square root of all the summed squares.
And when humans think of mean deviation, it's more intuitive to think of deviation in terms of regular units in relation to the mean rather than the square root of the sum of squared deviations. The former more accurately reflects human intuition.
This is what Taleb is saying. MAD is more intuitive to humans, and we can see this in particular because experienced statisticians, when asked to describe what standard deviation "means", actually describe MAD.
I don't understand. The usual explanation I hear (and that I think of) when explaining what a STD of x is, fall somewhere along the lines of "most (about 2/3rds) of the data will be within +/- x of the average". Is this wrong?
If not, can you give me an example of the typical description people give for STD that actually describes MAD?
Yes, that is wrong. It sounds like you might be thinking about the standard deviation of normally distributed data. In this case, you can say something like, "the probability an observation will be within about [mean-2sd, mean+2sd] is 95%".
But that's assuming the distribution is normal. In other cases, this doesn't hold, but there are more general statements, like Chebyshev's inequality.
I have no idea when people would describe SD as MAD, but wouldn't be too surprised, since people first coming into statistics often seem to have trouble conceptualizing how a squared difference could be viewed. It would be surprising if a trained statistician mixed the two up, because SD and MAD arise from something they should be familiar with--Lebesgue spaces.
I've got a very non-statistics math background, but what you say suggests that there would be a nice way to visualize standard deviations two-dimensionally (since they arise from an L_2 norm), and that it's the one-dimensional "bell curve cross-section width" pictures that confuse people.
It's really pointless to argue about the "best" deviation algorithm, at least on the basis of how it responds to outliers. The process of identifying and ignoring/deprecating outliers isn't something that can or should be lumped in with a simple notion of deviation, be it RMS, MAD, SD, or whatever. Any simple algorithm that you come up with to represent one data set may fail badly with another for this reason (and others.)
Outliers need to be removed, or at least understood, before performing any statistical calculations.
I don't think that answer helps. How are you assigning "too much" weight on outliers, and the process behind deciding the right amount of weight? Can you think of any concrete examples?
Obviously "too much" depends on the subject matter.
But the point is that,
1. STD gives a lot more weight to outliers than MAD.
2. People constantly hear STD and think it means MAD, for all the reasons the article mentions.
The argument isn't that STD always gives too much. It is that it gives a lot more than people expect.
The extreme example given in the article is that a statistical process can have infinite STD, but finite MAD. In other cases, say income, the STD might be double the MAD. That's bad if you think STD means MAD.
Anyhow, this could be solved by educating people on what STD actually means, or by just using MAD. The article apparently thinks the latter is more practical, especially since the benefits of STD have decreased over time.
In this case what Taleb is concerned with is decision making. The right amount of weight is what allows human beings to make good decisions. He believes that MAD is much more intuitive to humans and therefore leads to better decisions.
edit: OK I get that you wanted an example of what "too much" weight is. If you're looking for "how much the next datapoint will deviate from the mean, on average", then the MAD will tell you that, not the STDV. Except in some specific fields (maths, physics), people are much more interested in the MAD than the STDV, but all they get to make decisions is the STDV.
In many cases outliers are extremely important. One that comes to mind is high spenders in mobile games.
Trust me, if analysis was as simple as getting rid of outliers, treating everything as Gaussian, and retrieving simple summary statistics, then good data scientists wouldn't be paid $150k+ :)
(2) Generate the absolute deviations of your data from this median which is {4,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,998}
(3) Find the median of the absolute deviations which is 4.
It's ironic that Taleb prefers a statistic that ignores extreme examples (i.e. black swans) but he nevers seems to make sense to me. I've found MAD useful in dealing with noisy data.
> It's ironic that Taleb prefers a statistic that ignores extreme examples (i.e. black swans)
No that it incorrect on two fronts.
1) MAD does not "ignore" extreme examples, it just weights them the same as other examples. Nassim argues that the weighting of extreme examples in STD is excessive and makes STD less intuitive. I really don't know how you could say that MAD "ignores" extreme examples - they obviously do influence MAD.
2) The act of computing MAD or STD on a sample of observations has no relevance to Black Swan theory. In Black Swan, Nassim defines a black swan event as an unexpected event of large magnitude or consequence. Hence, by definition, an event that has already been observed cannot be a black swan event.
To put it another way, Nassim's main point in Black Swan is that using historical observations to estimate forward risk renders one fragile to Black Swan events - you could use any dispersion metric and this is still the case.
I went with Taleb's proposed definitions:
"Do you take every observation: square it, average the total, then take the square root? Or do you remove the sign and calculate the average?"
In my experience MAD refers to either Median Absolute Deviation or the Mean Absolute Deviation. I was using the median version which is a pretty common "robust" statistic. Although I have occasionally seen the mean version it seems to be less common in practice.
Take a look at the Wikipedia you linked. No version of Average Absolute Deviation is consistent with Taleb's definition. No squaring, no square root. Sounds more like a geometric mean.
This is exactly what is so frustrating about Taleb. His ideas only partly makes sense. He often seems to see the problem but his solutions are poorly thought out. Of course, he thinks his solutions are perfect and everyone else is an idiot.
In what field do you work that the median absolute deviation is used at all, let alone more than the mean absolute deviation?
When he talked about mean absolute deviation being sqrt(pi/2) sigma did that not make it abundantly clear what he was discussing?
>No squaring, no square root. Sounds more like a geometric mean
Do you even know what the geometric mean is? (It has a root function so your statement just sounds stupid)
Dispersion functions are built off the distance function under the metric you want to use. Standard deviation uses the L2 metric, which implies a euclidean distance function. (L2 corresponds to summing pow(u-x,2) and pow(sum,-2) as your functions)
Mean absolute deviation takes the L1 metric, which implies pow(1) and pow(-1).
This becomes summing pow(abs(u-x),1) and then pow(sum,-1), which, needless to say is the same thing as averaging the absolute differences.
It depends on whether the numbers provided are the actual data points themselves, or the deviation from median (the second is what the article provided).
>Except in some specific fields (maths, physics), people are much more interested in the MAD than the STDV, but all they get to make decisions is the STDV.
com'n guys, it all comes down to whether you like more romb or circle :) Interesting that MOND (modified Newtonian), if true, would suggest that a circle at very big distances looks like square (notice not like romb :), so physics may start to like it more.
"1) MAD is more accurate in sample measurements, and less volatile than STD since it is a natural weight whereas standard deviation uses the observation itself as its own weight, imparting large weights to large observations, thus overweighing tail events."
More accurate how? Less volatile, not overweighing tail events: what inference would I make incorrectly by using the standard deviation?
To be clear, I'm not arguing "for" standard deviation, I'm just saying that I wish this article had said more about why it's potentially misleading/less powerful.