Tangentially: I am really enjoying the book "All of Statistics" as a reference for better understanding things like histograms, kernel density functions, etc, and their parameters.
Aren't you refererring to "All of non-parametric statistics"? https://www.amazon.com/All-Nonparametric-Statistics-Springer...
Either way I can't recommend these books enough, really opened my eyes to the inner workings of statistics in a rigorous yet accessible way.
I am actually most interested non-parametric statistics, especially to "reformulate" as many statistical tests as possible using a small number of robust primitives (like the bootstrap.) More pointers in that direction would be very welcome :-)
If you're interested in histograms, I highly recommend "Expressing complex data aggregations with Histogrammar" by Jim Pivarski, where he talks about how for decades histograms have been used in unique ways to do amazing things in high energy physics (HEP, arguably the original Big Data field in compsci):
One thing early on is that HEP histograms treats histograms as a kind of accumulator that can stream in data (because the amount of data processed was typically too big to load into RAM all at once), instead of a chart. From that starting point you can add, divide, multiply histograms with histograms to build crazy things.
The results are no longer really histograms of course, but it's fun to see how something that we just think of as a chart can be (ab)used like that.
Kernel density plots should be preferred to histograms in nearly all cases. Histograms can be seen as a kernel density plot with a uniform kernel that has been sampled. Since a kernel density plot with a uniform kernel has unbounded frequency content, this sampling introduces aliasing, which is why you get all of these strange effects when adjusting the bin width and offset. In fact, if the distribution of your data happens to be a sine wave, then the histogram will also be a sine wave, but, due to aliasing, it may have a different frequency and phase.
For a kernel density plot with a Gaussian kernel, the kernel size does effect the result, but the situation is much better than with histograms for two reasons:
1. The kernel density plot varies smoothly as the kernel size changes, and so there is greater confidence that you have seen the whole story by only looking at a few kernel sizes.
2. You can construct a kernel density plot with a larger kernel given only a kernel density plot with a smaller kernel. Since the convolutions of two Gaussians produces a new Gaussian with a variance equal to the sum of the input variances, you only have to convolve the small-kernel plot with another Gaussian to produce the large-kernel plot. This, again, means that you have more confidence that you've seen the whole story by looking at only a few kernel sizes.
As a side note, there is technically a 1:1 relationship between 1D datasets and kernel density plots with a Gaussian kernel, and so in theory you don't lose any information by constructing the kernel density plot. In practice, however, you do lose information due to limited precision.
When you think you want to plot a histogram, it's often a better idea to plot a (empirical) cumulative distribution [0] instead. You don't have to worry about how to select your bin limits and you can usually put several in the same plot for comparison without making it unreadable due to overlap.
I like using cumulative distributions because they make small changes in the data a little more obvious. e.g. if all the buckets are 10 but there's a section where they're 11, that difference will show up in a cumulative distribution as a bend in an otherwise straight line, which in my opinion is a much easier difference to see
In my introductory programming class, we teach a few basic forms of chart visualization. By far, students struggle the most with Histograms. Even more frustrating, they love line plots and attempt to use them everywhere. Despite my explanations that you can almost always use histograms, and you can almost never use line plots! Yet they go with what they find more intuitive...
Depends what you're doing I'd say. In physics based modeling, which is used more in e.g. engineering, line plots are often very useful. When examining noisy real world data, not so much.
>
We notice that you're not using the Google Chrome browser. You're welcome to try continuing—but if some parts of the essay are rendering or behaving strangely, please try Chrome instead.
20 years ago we had the "this site works best in Internet Explorer" buttons on way too many sites. Plus ça change...
That said, the site works just fine in Firefox, Edge and even IE11 too. So, if anything, the message is a sign of sloppiness in not even bothering to check.
Tangentially: I am really enjoying the book "All of Statistics" as a reference for better understanding things like histograms, kernel density functions, etc, and their parameters.
https://www.amazon.com/All-Statistics-Statistical-Inference-...