What's so hard about histograms?

lukego · on July 22, 2017

What a beautiful presentation!

Tangentially: I am really enjoying the book "All of Statistics" as a reference for better understanding things like histograms, kernel density functions, etc, and their parameters.

https://www.amazon.com/All-Statistics-Statistical-Inference-...

emblaegh · on July 22, 2017

Aren't you refererring to "All of non-parametric statistics"? https://www.amazon.com/All-Nonparametric-Statistics-Springer... Either way I can't recommend these books enough, really opened my eyes to the inner workings of statistics in a rigorous yet accessible way.

lukego · on July 22, 2017

Just ordered that one too :-).

I am actually most interested non-parametric statistics, especially to "reformulate" as many statistical tests as possible using a small number of robust primitives (like the bootstrap.) More pointers in that direction would be very welcome :-)

Tangent to a tangent: The other most enjoyable stats book I have found is "Statistical Modeling: A Fresh Approach" http://www.mosaic-web.org/go/StatisticalModeling/

vanderZwan · on July 22, 2017

If you're interested in histograms, I highly recommend "Expressing complex data aggregations with Histogrammar" by Jim Pivarski, where he talks about how for decades histograms have been used in unique ways to do amazing things in high energy physics (HEP, arguably the original Big Data field in compsci):

https://www.youtube.com/watch?v=mB4Chl0ly-g

One thing early on is that HEP histograms treats histograms as a kind of accumulator that can stream in data (because the amount of data processed was typically too big to load into RAM all at once), instead of a chart. From that starting point you can add, divide, multiply histograms with histograms to build crazy things.

The results are no longer really histograms of course, but it's fun to see how something that we just think of as a chart can be (ab)used like that.

jtxx000 · on July 22, 2017

Kernel density plots should be preferred to histograms in nearly all cases. Histograms can be seen as a kernel density plot with a uniform kernel that has been sampled. Since a kernel density plot with a uniform kernel has unbounded frequency content, this sampling introduces aliasing, which is why you get all of these strange effects when adjusting the bin width and offset. In fact, if the distribution of your data happens to be a sine wave, then the histogram will also be a sine wave, but, due to aliasing, it may have a different frequency and phase.

For a kernel density plot with a Gaussian kernel, the kernel size does effect the result, but the situation is much better than with histograms for two reasons:

1. The kernel density plot varies smoothly as the kernel size changes, and so there is greater confidence that you have seen the whole story by only looking at a few kernel sizes.

2. You can construct a kernel density plot with a larger kernel given only a kernel density plot with a smaller kernel. Since the convolutions of two Gaussians produces a new Gaussian with a variance equal to the sum of the input variances, you only have to convolve the small-kernel plot with another Gaussian to produce the large-kernel plot. This, again, means that you have more confidence that you've seen the whole story by looking at only a few kernel sizes.

As a side note, there is technically a 1:1 relationship between 1D datasets and kernel density plots with a Gaussian kernel, and so in theory you don't lose any information by constructing the kernel density plot. In practice, however, you do lose information due to limited precision.

svara · on July 22, 2017

When you think you want to plot a histogram, it's often a better idea to plot a (empirical) cumulative distribution [0] instead. You don't have to worry about how to select your bin limits and you can usually put several in the same plot for comparison without making it unreadable due to overlap.

[0] https://en.wikipedia.org/wiki/Empirical_distribution_functio....

lukego · on July 22, 2017

I'm also enarmored of ECDFs lately. I was really happy to discover the KS-test for making comparisons between them.

https://en.m.wikipedia.org/wiki/Kolmogorov–Smirnov_test

https://stats.stackexchange.com/questions/288416/non-paramet...

aji · on July 22, 2017

I like using cumulative distributions because they make small changes in the data a little more obvious. e.g. if all the buckets are 10 but there's a section where they're 11, that difference will show up in a cumulative distribution as a bend in an otherwise straight line, which in my opinion is a much easier difference to see

convolvatron · on July 22, 2017

isn't seeing a little hump in a pdf clearer?

wodenokoto · on July 22, 2017

Is there a way to read this decently on mobile?

I've tried Firefox reading mode as well as pocket but they both cut off large parts of the text.

acbart · on July 22, 2017

In my introductory programming class, we teach a few basic forms of chart visualization. By far, students struggle the most with Histograms. Even more frustrating, they love line plots and attempt to use them everywhere. Despite my explanations that you can almost always use histograms, and you can almost never use line plots! Yet they go with what they find more intuitive...

pletnes · on July 22, 2017

Depends what you're doing I'd say. In physics based modeling, which is used more in e.g. engineering, line plots are often very useful. When examining noisy real world data, not so much.

agumonkey · on July 22, 2017

Got me curious about non 1D histograms https://www.r-bloggers.com/5-ways-to-do-2d-histograms-in-r/

ablaba · on July 22, 2017

The History of Histograms (vldb paper) http://www.vldb.org/conf/2003/papers/S02P01.pdf

SeanLuke · on July 22, 2017

Unfortunate that they're talking about distributions and yet the very first example they use ("The paintings of Bob Ross") isn't a distribution.

RodericDay · on July 22, 2017

> We notice that you're not using the Google Chrome browser. You're welcome to try continuing—but if some parts of the essay are rendering or behaving strangely, please try Chrome instead.

what a world

shdon · on July 22, 2017

20 years ago we had the "this site works best in Internet Explorer" buttons on way too many sites. Plus ça change...

That said, the site works just fine in Firefox, Edge and even IE11 too. So, if anything, the message is a sign of sloppiness in not even bothering to check.

perryprog · on July 22, 2017

And Safari. Also, it’s not very responsive (at least for mobile)