Hacker News new | past | comments | ask | show | jobs | submit login
Why you should be wary of relying on a single histogram of a data set (stats.stackexchange.com)
81 points by aton on April 12, 2013 | hide | past | favorite | 20 comments



As mentioned, one should really be using a kernel density plot instead of a histogram, except when there are already classes in the data.

In R, one can simply do:

  library("ggplot2")
  library("datasets")
  ggplot(faithful, aes(x=eruptions)) + geom_density() + geom_rug()
which gives a chart like this (http://jean-francois.im/temp/eruptions-kde.png). Contrast with:

  ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=1)
which gives a chart like this (http://jean-francois.im/temp/eruptions-histogram.png).

Edit: Other plots mentioned in this discussion:

  ggplot(faithful, aes(x = eruptions)) + stat_ecdf(geom = "step")
Cumulative distribution, as suggested by leot (http://jean-francois.im/temp/eruptions-ecdf.png)

  qqnorm (faithful$eruptions)
Q-Q plot, as suggested by christopheraden (http://jean-francois.im/temp/eruptions-qq.png)


But then you would have to choose a certain kernel and assume the data conforms to that distribution which isn't always true.


This is really not true.

A histogram is considered (by statisticians) to be a non-parametric density estimator. Kernel density estimation is also considered a non-parametric density estimator.

The kernel function you use does not depend on the distribution of your data. If you have normal data, you can use an equation to provide the 'optimal' bandwidth in that case, but this is about bandwidth selection and not the kernel itself.

You can also, say, fit a spline to a univariate dataset. We can also call this non-parametric in the sense that the number of knot parameters, etc., can grow with the data size. This doesn't use any probabalistic machinery until you actually 'fit' the spline.

My takeaway from the original post is that you should probably be aware of how things work if you use them, or the defaults might bite you. I like histograms but I don't like bin-size/position optimization algorithms and just use lots of bins, I like kernel density estimates with the data points lightly shown, and in either case you're gonna fool yourself a couple times.


Indeed, but that estimate is likely to be less misleading in most cases than a histogram(which is just a uniform kernel that is always aligned with bin boundaries).


One particular parameter of the kernel, bandwidth, can result in highly misleading visualization given arbitrarily chosen values. Here is an example http://en.wikipedia.org/wiki/File:Comparison_of_1D_bandwidth...

The smoothing give unsavvy readers a false sense of accuracy. With histogram they can at least tell it's an approximation.


Yup. Luckily there are good methods for choosing the bandwidth:

http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/Software/optim...


Big ups on your use of GGplot--best R graphing capabilities around! In response to your update about the QQ Plot, I didn't compare against normality like you did (the article is comparing exponentials, so a normal QQ isn't the best choice). The QQ Plot just compars the quantiles of one distribution to another (could be an ecdf against a hypothesized cdf, or an ecdf against another ecdf...). Essentially, by plotting one set of points against another, I'm suggesting that the empirical distribution of Annie is the same as the empirical distribution of Brian, or any other pairing.


Good point; I wasn't using the data from the link but what you mention (doing a QQ plot of distribution pairs to check if they are similar) is probably what I should've posted instead of a QQ plot of some other dataset against an ideal normal distribution.

And as you mention, ggplot is seriously awesome.


Yes, probability density estimation might be fun, but the simplest thing to do when comparing distributions, if you're worried about binning issues, is to plot their empirical cumulative distribution functions.


Completely agree. If your audience gets them, that's the most robust, easiest to interpret way to visualize a distribution (continuous or discrete). But it may require a few words of explanation depending on the audience.


This is what you should be doing:

  plot(density(Annie), col="red")
  lines(density(Brian), col="blue")
  lines(density(Chris), col="green")
  lines(density(Zoe), col="cyan")
This is the plot you get: http://i.imgur.com/sY2awX7.png



Hmm, reminds me much more of Anscombe's quartet: http://en.wikipedia.org/wiki/Anscombe%27s_quartet


Interesting paradox. I haven't seen that many statisticians using just a histogram when determining whether a certain distribution fits data reasonably. Kernel Density Estimators are a much better choice (for continuous data, like the data in the post), but they are also affected by your choice of bandwidth. When it comes down to it, like going to the doctor, sometimes the best choice is to get a second (or third!) opinion. For what it's worth, drawing a QQ Plot (something I've seen in every statistical consultation I've ever done) reveals the dependent structure of the data immediately and obviously in the form of a perfect linear relationship between any two variables.


I think it's foolish to assume to have the magic tool that will instantly give you a meaningful probability distribution that can statistically reproduce arbitrary datasets. Once you've choosen a certain bandwidth (by fixed binning, or choosing a kernel) you've lost the ability to resolve structure finer than this, and you cannot quantify parameters (e.g. the macroscopic view) much larger than that.

But of course, playing around with these parameters will hopefully give you a nice plot, insight into the problem and allow you to propose a proper model describing your data. Then you can fit this model to your data and extract the model parameters more precisely.

And when the distribution width of the toplogical features match your kernel sizes, of course, this PDF will look almost identical as the density plots.


Indeed, although Q-Q plots are very unlikely to be understood by people who don't have a good grasp of statistics, whereas a misleading histogram will be (and probably without knowledge of the caveats behind histograms).


A great point, but therein lies my biggest complaint with the simplification of statistics that I see in the startup world--sometimes the technical details are actually important. As an analogy, while mass-production has given us a car that anyone can operate, we are largely helpless when one breaks down. Complications abound when individuals try to leverage an overly-simplistic view of a subject (raise your hand if you've heard "We are 95% sure the true [...] lies in this confidence interval").

To the credit of the shadier individuals in my profession, this histogram subtlety nicely highlights how it can be quite easy to bend the data to your argument using ad-hoc procedures (KDEs, hists, QQs, boxplots). A carefully chosen bin width, smoothing parameter, or covariate can present a different view of the data than some other parameter/covariate. That's why it's nice to have other statisticians capable of reproducing and disseminating the work.


Is this basically just an effect of quantization aliasing?


In other words, rounding error.

There's a great story of a histogram of heights in Napoleon's army having two peaks, eliciting all sorts of theories. Reality was that height in the army was normally distributed, but that the data was collected in centimeters and people were looking at a histogram binned by inches. The middle bin contained only two centimeter counts while the bins on either side contained three each, thus having dramatically more counts.


Would love to find the citation for that one...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: