As I read the first five items on that list, each one made me think more and more "then isn't the answer to not write distributed systems?". But then the last item gives almost the opposite advice: "Extract services".
If you don't need (micro)services, don't build them -- they only introduce complexity in communication between components, build tooling, diagnosing issues, and they slow things down. It's amazing what you can do on a single machine, or, beyond that scale, a simple two-tier architecture with a few web nodes and a big SQL database. Systems like that are much cheaper, more predictable, and easier to understand. Of course, if you're Google or Facebook scale, distributed systems are (a) necessary (evil).
Additionally, I'm not sure his last point "extract services" is a good rule of thumb. If something can be a library, it'll be much simpler as a library. Don't make an "email service" when you can just import an SMTP library, wrap it in a couple dozen lines of code, and be done with it.
"Using a mean assumes that the metric under evaluation follows a bell curve but, in practice, this describes very few metrics an engineer cares about. “Average latency” is a commonly reported metric, but I’ve never once seen a distributed system whose latency followed a bell curve. If the metric doesn’t follow a bell curve, the average is meaningless and leads to incorrect decisions and understanding. "
I was never great at stat's. Can somebody explain why this is?
Imagine a hospital where the only patients are either mothers, or babies. All the babies are in the age range zero to 2 months, and the mothers are in the range 16 to 40 years old (mostly). If you compute the average age, it comes out at 12 years old.
What has the average told us? It has told us nothing because it's meaningless with a bi-modal distribution.
For more info read up on the normal distribution, guassian distribution, bi-modal distributions, and the central limit theorem.
I wouldn't say that I am great at stats either, but AFAIK this is mostly because of super high tail-latencies (usually the 99th percentile, commonly caused by garbage collection or cache misses) or because the latencies have multimodal distribution [1] (e.g. where you have a request fast path and slow path in the monitored system and so the latencies "group" around multiple points).
The average latency does not tell us much since it does not really represent anything meaningful - i.e. it is not the latency of typical user/request as you might expect since the average is being skewed by the extremely high tail-latencies [2] or by the multiple modes (the peaks in latency histogram). The typical latency would be more likely better represented by median latency (not in the multimodal distribution case afaik).
As for why not go just go with median latency: you usually need to make multiple requests in parallel and end up waiting for the slowest request of the group. The 95th, 98th or 99th percentile is commonly used to cover for this (sorry can't find a suitable reference). This is also preferable in case of the multimodal distribution (well at least for monitoring and/or general performance diagnosis purposes since you usually care about the worst/tail cases).
I think it is because of outliers and how diatributions can look like, if you only know the mean.
1. A single extreme outlier can move your mean far away from where you would expect it to be, seeing all the other values.
2. If you see the mean alone, you might thing values are around that value, but not even a single value has to be anywhere cloe to the mean. For example: 1 1 1 1 9 9 9 9 → means is 5. Pathological example, but you can mix the numbers however you want and there are infinitely more examples, where the mean could mislead you.
3. The mean is part of the arguments to the formula of a Gaussian (bell shaped curve, normal distribution). The other is the standard deviation. So if you know the mean, you have already some valuable info on the bell shaped curve. You already know where its center is. That is, why giving the mean makes sense in that case.
Surely others can put it better or add lots of reasons.
Think FPS charts in GPU reviews. If the distribution of the latency is uneven, eg. you get fast response times, but only in a small number of requests, the average won't tell you as much information as what the latency is 50%, 90% or 99% of the time.
If you don't need (micro)services, don't build them -- they only introduce complexity in communication between components, build tooling, diagnosing issues, and they slow things down. It's amazing what you can do on a single machine, or, beyond that scale, a simple two-tier architecture with a few web nodes and a big SQL database. Systems like that are much cheaper, more predictable, and easier to understand. Of course, if you're Google or Facebook scale, distributed systems are (a) necessary (evil).
Additionally, I'm not sure his last point "extract services" is a good rule of thumb. If something can be a library, it'll be much simpler as a library. Don't make an "email service" when you can just import an SMTP library, wrap it in a couple dozen lines of code, and be done with it.