Hacker News new | past | comments | ask | show | jobs | submit login

Great writeup. I think this should go hand in hand with articles explaining why p <= 0.05 is not an end-all be-all confirmation of your hypotheses/conclusions. Before I jumped to the software engineering world, I did biostats and was essentially a "bioinformatician". You quickly realize how many experts in the field misuse statistical tools entirely while using their results to prove a point, or worse, draw conclusions incorrectly from their results. A big faux pas I saw a lot was using normality-assumed parametric tests on non-normal data where the skew was clearly significant (i.e. you couldn't get away with it like some non-normal data dists.). Seriously, go take a look at some bioinfo papers (or any biology papers for that matter), it's getting pretty bad. However when we learn (even in master's or Ph.D. programs) about these mathematical tools, we are not often taught about what the math means. I'm sure I'm not the only one to have been taught formulas as a means to prove your research rather than the theory/intuition behind them. Luckily there are articles such as these that force you to step back and consult the maths again to learn what is really going on.

As an aside, I'm sure this sort of thing happens in all walks of life, not just maths/data science. Some programmers don't understand the intuition of certain things that they code and when it is time to explain they will likely freeze because they know how to code it, but don't really know why the code works fundamentally.




Really?

> Do you take every observation: square it, average the total, then take the square root? Or do you remove the sign and calculate the average?

Neither of these is even an attempt to measure average daily temperature variation. (Assuming any reasonable definition...)

If you're talking about variation from day to day, then looking at differences in max and min, or at given points in time, or on the same day in different years, would be some approaches. Averaging absolute values makes no sense whatsoever -- if the observations were 20 and -20 the result would be the same as if they were 20 and 20. And calculating the standard deviation of the observations is calculating something else again (it might be standard deviation, or it might just be something dumb). Neither of these are problems with standard deviation.

It's sad if newspapers or their readers don't know what standard deviation means, but they're pretty much innumerate across the board so it's not clear whether further muddying terminology is going to help anyone.


They used the word "change" in the paragraph before the one you quoted. I think by "observation" in this paragraph, they mean "x_i - M(x)" as in https://en.wikipedia.org/wiki/Average_absolute_deviation: the difference between the current value and some measure of central tendency (mean, median, mode). It's not explained well, but if you make this assumption, the article makes sense.


The averaging of absolute values in the article's context is happening around a mean of 0(averaging the variations). In general, one would average the absolute values of the difference from the mean.

Yes, it would be silly to convert a -20 to 20 if the mean isn't 0.

The advantage with STD is that it is easier to compute relative to MAD in a situation like a random walk. For instance when one has a +1, -1 equal probability one dimensional random walk, so the mean is 0, X = sum of X_i, E(X²) = n is straightforward, whereas computing E(|X|) is not so easy.


That's as clear as mud. What does a -20 vs. a 20 mean? That it was colder at noon than at midnight?

Assuming that there was some sensible definition (e.g. deviation from the past mean temperature, which is not at all what was stated) then "average deviation" has two obvious interpretations. Re-labeling "standard deviation" or "mean deviation" would not be helpful in this case, and the -20 and 20 values STILL don't help the case.


The article isn't clear but the deviation is either from the sample mean or the mean of the underlying random variable. You are trying to find the expected difference from the average and hence the absolute value(otherwise, the positive and negative deviations cancel out).


Average max, min, or average average?


> A big faux pas I saw a lot was using normality-assumed parametric tests on non-normal data where the skew was clearly significant (i.e. you couldn't get away with it like some non-normal data dists.)

Depends on how much data you have. With a couple hundred observations, you can have as much skew as you like and the normal approximation will still be pretty good. I only mention this because there are a lot of misconceptions about how statistics relies on normal data, when really it mostly just relies on the distribution of the mean being normal, which is pretty much a given because of the central limit theorem. There are much worse sins and abuses -- which yes, unfortunately you do see all the time in scientific papers.


In general it is a misconception that real data always follows a normal distribution. It is true that if you sum many /independent/ random quantities, then the result is approximately normal (e.g., the central limit theorem and generalizations). But real data tends not to be independent. Many real world quantities follow extremely skewed distributions. E.g, Zipf's law, Korcak's law, Pareto's laws.

For a concrete example, if you look at the distribution of the number of friends users have on social networks, you might expect that 95% of people have the mean number of friends +/- a few standard deviations (since this would be the case for a normal distribution). It would be virtually impossible for someone with a number of friends that is say thousands of standard deviations away to exist, yet there will be many such users in social networks (celebrities, bot networks, etc). In reality, the empirical distribution in this case follows an extremely skewed distribution.


> It is true that if you sum many /independent/ random quantities, then the result is approximately normal...

Don't forget defined variances! The end result could be more generally levy-alpha (alpha < 2.0), as it is with many financial instruments.... Normality requires defined variance.


And yet a Pareto distribution still has a mean, and the sampling distribution of that mean is approximately normally distributed.

Of course I'm not claiming that you can just pretend that a Pareto distribution is a normal distribution, but statistical tests are generally concerned with differences in means (group A does on average 25% better than group B) so it's the sampling distribution we're interested in, not the parent distribution.

You make a good point about autocorrelation and dependent data, but that's a very different issue. To riff on your example about social networks, you'd have dependent data if you're trying to see what kind of news articles people like to read, if those preferences turn out to be mostly guided by what friends are reading.


"And yet a Pareto distribution still has a mean, and the sampling distribution of that mean is approximately normally distributed."

This is often wrong. The central limit theorem requires finite variance and some (yet common) Pareto distributions have infinite variance.


yeah, that's why I had to use the parens saying that it was clearly data that would affect their stats (think about studies where you can only extract data from 5 rats per group, etc.). Also why I said normality-assumed parametric tests. There are many other parametric tests which aren't based on "normality" assumptions. But yes as you said, you can absolutely use normality-assumed parametric tests on non-normal data within reason and your conditions you listed are within reason. Of course this is my opinion and my opinion could just as easily be entirely wrong as decided by the expert community :D

And as always, your allowances depend on the test you are performing :P Also gotta love statistics for keeping a large list of customary exceptions determined by the community too.


>you can absolutely use normality-assumed parametric tests on non-normal data within reason and your conditions you listed are within reason

I would be careful throwing around this kind of rhetoric.

There are special cases where the parametric assumption can be relaxed, it is not a general rule. This is why the word 'parametric' is used.


A parametric test is making an assumption about the distribution of the random variable, not it's mean.

Inside many of these tests, they might make use of CLT assumptions of sample means, but that doesn't mean they don't still depend on the distribution assumptions.

Skewness is a major consideration and can lead to completely different inferences.

Let's say we wanted to find the median and our distribution was assumed to be normal. Under no skewness, the sample mean would be a good approximation. Under skewness the sample mean would be a very bad approximation.

If the test in question is solely about the mean of the random variable, and nothing else about the distribution, then it's possible that the normality assumption only need to extend as far as the sample mean (ala t-test). But that's hardly a parametric test anymore is it?


It's not clear to me what kinds of tests you are referring to. Ordinary least squares regression, for example, is all about estimating conditional means, and it is very much parametric. Just finding the best estimates for the parameters of a one-dimensional distribution is usually not particularly interesting, is certainly not what statisticians spend most of their time on and in any case nobody's suggesting that the population mean is always equal to the population median.


Why do you think we have the classification 'parametric' if the only thing that matters is the distribution of the sample mean? If it's all going to converge to be normal as you say, why is there parametric and non-parametric tests?


Regression works like this: E[Y|X] = Xβ. It is parametric because you model the conditional mean as a weighted sum of various predictors, and these beta "weights" are your parameters. This is true of ordinary regression, Poisson regression, binomial (logistic) regression and so on. An example of nonparametric regression would be something like regression splines.

Why are there nonparametric tests? Because for small sample sizes you can't always trust the normal approximation, and as you state this might be due to something like skew. This takes nothing away from the fact that inferential statistics is almost always about comparisons of means. And yes, the t-test is a parametric test, of which the Mann-Whitney or Wilcoxon would be the nonparametric equivalents.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: