Quantile Digest (Q-digest) or something similar is what I believe is desired here.
From what I understand it's a fixed size data structure, represents quantiles as a tree of histogram bands, pruning nodes with densities that vary the least from their parents to achieve error / size trade-off. They also have the property that you can merge them together and re-compress in order to turn second data into minute data, or compress more accurate (large) archival digests into smaller ones to say, support stable streaming of a varying number of metrics across a stream of varying bandwidth by sacrificing quality.
They're pretty simple because they're designed for sensor networks, but I think you could design similar structures with a dynamic instead of fixed value range, and variable size (prune nodes based on error threshold instead of or in addition to desired size).
If anyone knows of a time-series system using something like this, I'd love to learn about it.
You know what I've always found really useful? Entire distributions. Forget means, medians, percentiles, etc. — give me the complete distribution (along with sample size) so I can understand all of the nuances of the data.
(Better yet, just give me the raw data so I can analyze it myself. I find it hard to blindly trust someone else's conclusions considering all of the p-hacking going on nowadays.)
Yeah. We're starting to contemplate a new monitoring/performance/etc. system, and I'm thinking I don't want to do any of the long-term storage aggregration like in rrd. We'll store the full data for each item forever. Not sure if we can afford the storage, but I want to try for a while and see if it works.
We did something similar at Netflix. We had all the aggregations but also stored all the raw data. The raw data would be pushed out of the live system and replaced with aggregates and then stored as flat text in S3. If for some reasons you needed the old data, you just put in a ticket with the monitoring team to load the data back into the live system (I think this is even self service now).
The system would then load the data from S3 back into the live system via Hadoop. Turns out it was pretty cheap to store highly compressible files in S3.
I've built a similar system as well. Raw data compressed and stored in s3, but aggregations are stored in postgres. Data stored in Postgres is compressed binary representation of a histogram and I added few C functions in postgres to do things like select x,y, histogram_percentile(histogram_merge(data), 99) group by x,y etc..
My company stores all raw data for pretty much everything forever (and, in finance, there are a lot of things). It's binary, compressed with xz, and stored in several places. The grand total Glacier cost is something like $50/month.
(Do NOT, for the love of your sanity, emulate this design exactly. Use Backblaze or Google's nearline storage. Do not touch Glacier at all if you can avoid it. When I wrote the Glacier integration, I did it because it was far cheaper than its competitors. That's no longer the case.)
But the raw data isn't everything either, it's usually just a sample. The analysis of it is there to explain what the data means.
If you're saying that people should provide the full data along with their analysis, I fully agree! But if you're saying "skip the analysis", then unfortunately I, and most people, are not statistically educated enough to do all the analysis ourselves.
Always keep in mind that estimating distributions from samples is very hard, in particular for multidimensional and continuous variables. Sometimes even infinite samples do not suffice (I am not kidding).
The good news is that there are interesting and relevant tasks (for example classification) where you do not need to know the distribution to solve the task optimally, although if you somehow managed to do so (find the distribution) you would be able to solve the task optimally. Or more succinctly, knowing the distribution is sufficient but not necessary for such tasks (small mercies)
It's also often useful to think of the true distribution of the underlying system/dynamic in question, and to consider a histogram as a sample that therefore has sample error. E.g. if the histogram's tail bins have very few samples then they're giving you some information about the shape of the distribution in that region, but not the exact probability of events in that region.
This is an excellent point. The distribution of the measurements are the distribution of the sample, not of the population. In my experience very few non-statisticians understand the difference. Statistics is hard.
This is true. I almost mentioned the difficulties of estimating the kernel density bandwidth in my original post, but thought that would be getting too far on a tangent.
Your post is a little confused. For actual data, there's no distinction between the empirical distribution and the data, so when you want "along with sample size" with your "distribution", that's what you're getting. (same with your "raw data").
A nice distribution diagram is very informative but if you want to track the performance through time and have warnings triggers in place you need to be able to capture what you care about in a handful of numeric metrics.
I've recently written a page about "averaging" percentiles correctly by approximating the combined histograms of two distributions. This is demonstrated in a live time diagram with logarithmic time axis here:
Great article, I ran into this exact problem at work today. You can also get a very similar problem if the timeseries aggregation system you're using does any pre-aggregation before you calculate the percentile - for example, if you sample your servers every 90 seconds, then any latency number it reports is likely already averaged over the requests the server received during that time period, meaning your 99th percentile number is really the latency of the 99th percentile server, not the 99th percentile request. Using latency buckets solves this problem as well, however.
The article is useful because it outlines many different ways in which to monitor the performance of a system, many of which are better than just looking at the mean and P99. However, the main thesis that "an average of a percentile is meaningless" is just plain wrong. If the distribution is fixed, then averaging different P99 measurements will give you the best possible estimate of what P99 is in the population (as opposed to your sample.) If the distribution is moving (you're making performance improvements or your user base is growing) then a moving average of a percentile will move with it.
You are right when looking at the true P99 of the whole distribution. Then the P99 will always have the same value. The problem arises when estimating a distribution from incomplete data.
E.g. estimating the P99 from 200 data points each will give you a large variation. In this case averaging will be wrong. This gets worse if you experience bunching. E.g. a server will have a high ping on several packets when the load is high; after that it will have a low ping for thousands of packets when the load is low. Here, it just doesn't work.
Instead, you can get a better result by adding the (estimated) histograms and resampling the quantiles from the combined distribution function.
Even P99 estimates averaged from many tiny samples are at most inaccurate, they won't be biased.
The problem you talk about is one of granularity: you don't want to know how bad things are in general, you want to know how bad things are when they are actually bad. That's perfectly legitimate, but it's a different metric altogether that just happens to also be based off of the 99th percentile. It depends on your definition of "the distribution".
This probably sounds like nitpicking, but people should know what they're talking about when they call out others for being wrong.
This looks biased to me even though the distribution is uniform. I did not expect that. Maybe the quantile function is broken.
EDIT: I've installed R and run your code. You use a very low number of samples in the mean, so I changed it a bit. 1. I use the uniform distribution, so we know the true value of the 99th percentile is 0.99. 2. I've increased the number of samples:
x <- runif(100000)
quantile(x, 0.99)
mean(sapply(1:2000, function(i) {
i <- i * 50
quantile(x[i-49:i], 0.99)
}))
When run multiple times, I've found that the mean is always lower than the quantile over the full sample.
Actually, doh, you're right, my bad, turns out sample quantiles are only asymptotically unbiased, and a better estimator would take some sort of weighted average of P99 and P99.5 with the weights depending on the sample size. (You still don't need the full population and you don't have to merge histograms, though.)
I find this part of statistics very hard to argue about. Simulations only work with well-known distributions, but even then not all of them. Even when you have a simulation that confirms hypothesis A1 using distribution D1 it says little about some pathetical distribution D2.
Using mathematical precision might take years to get it right and often there is no easy answer.
What I've been looking for is something that is fast and works "good enough". Estimating the 95th percentile and averaging it did not work well in my use case. Using the histogram method does work well although it certainly is not perfect.
Thanks, this is informative and pretty similar to what I've been implementing in JavaScript. I guess I'll have to take a deeper look into literature when doing an update to that topic.
Gave the post an upvote because it is interesting from a theoretical perspective, but I have a hard time imagining a real life scenario where averaging a 99 percentile will lead you to the wrong conclusion.
Perhaps I'm wrong, but whenever I'm looking at the tail of a distribution, it's usually just to understand the order of magnitude, not to reach a precise number.
The problem with percentiles is that you are discarding data points. e.g. Measuring the 98th percentile involves discarding the top 2% of data.
The problem with that is, sometimes the top 2% you are discarding might correlate to your top 2% of customers...and you are literally throwing their data away by using percentiles. Not good.
My recommendation is to pick two aggregation types. Maybe Percentile and Maximum, or Maximum and Mean, or Percentile and Mean. You can't really go wrong with that approach.
Does anyone know the math behind exactly how wrong the averaged percentiles are? My dim understanding of stats makes me think the central limit theorem is at play here; the averaged p99 values will tend towards a normal distribution, which is obviously wrong. Would love to be schooled on it.
I think the lesson is, don't just blindly calculate some numbers / metrics, have a look at your data (visually!) and see if it makes sense (for instance do you use the 99th / 95th / 90th percentile).
A day late, but I wanted to add: at New Relic (where I work) we ended up just deciding to store all the data points. We literally store every transaction and page view for our customers and then go back and read every data point when they are queried.
It depends who's reading it. The whole genre of titles "* you don't know", "you're wrong about *" is condescending and frequently false.
I suspect that's what the parent post was referring to and I would agree: percentile calculations do work the way I think, and the article didn't say anything controversial.
I'm the article's author. I don't think the headline HN chose is great but I don't mind if people have different interpretations of what I write. Hopefully all the commenters here are actually reading the original article and judging it on its merits (and flaws), right? right?
From what I understand it's a fixed size data structure, represents quantiles as a tree of histogram bands, pruning nodes with densities that vary the least from their parents to achieve error / size trade-off. They also have the property that you can merge them together and re-compress in order to turn second data into minute data, or compress more accurate (large) archival digests into smaller ones to say, support stable streaming of a varying number of metrics across a stream of varying bandwidth by sacrificing quality.
They're pretty simple because they're designed for sensor networks, but I think you could design similar structures with a dynamic instead of fixed value range, and variable size (prune nodes based on error threshold instead of or in addition to desired size).
If anyone knows of a time-series system using something like this, I'd love to learn about it.