Hacker News new | past | comments | ask | show | jobs | submit login
It Takes Long to Become Gaussian (two-wrongs.com)
88 points by kqr on July 20, 2023 | hide | past | favorite | 36 comments



Very important article. Software engineering also has a lot of fat-tailed distributions that can masquerade as normal distributions. Our intuition tends to assume normal distributions to one degree or another and we must learn when to disregard it.

One example that is both very important to software engineers and will hit close to home is effort estimations. It is a super fat-tailed distribution. The problem is that a lot of tools managers will use to try to make schedules implicitly assume that errors will essentially be Gaussian. I can draw this fancy Gantt chart because OK, sure, task 26 may be a couple of days late, but maybe task 24 will be a couple of days early. As we all know from real life, tasks are effectively never early, and it's completely expected that at least one task from any non-trivial project will be budgeted for 2 weeks in the early planning stage but actually take 4 months. Managers are driven batty by this because intuitively this feels like a 7 or 8 sigma event... but in real life, it isn't, it's so predictable that I just predicted it in your project, sight unseen, even though I know nothing about your project other than the fact it is "nontrivial".

But there's a ton of places in software engineering, especially at any kind of scale, you must learn to discard Gaussian assumptions and deal with fatter tailed distributions. I get a lot of mileage out of the ideas in this essay. It's hard to give a really concrete guide to how to do that, but it's a very valuable skill if you can figure out how to learn it.


It might sound silly when spoken out loud, but I found the following example quite enlightening about this phenomenon: imagine that you want to drive to the airport, and you need to estimate how long it will take you to get there. You can simply measure how long it will take you to drive by dividing average speed by distance.

If you actually drove there many times and measured your trip length, most of the time you would end up being later, sometimes horribly so.

Why is that? The explanation is actually trivial: no matter how well things work, there is an absolute minimum time required to get to the airport. On the other hand, there are many things that might delay your trip: heavy traffic, closed roads, accidents...

So, on the aggregate, there are very few things that can "go well", and they will reduce your trip time only a bit. On the other hand there are many things that can go wrong and make your trip last much, much longer.

That's a sort of intuitive explanation of why things like software estimates are fat-tailed instead of normal.


>Why is that? The explanation is actually trivial: no matter how well things work, there is an absolute minimum time required to get to the airport. On the other hand, there are many things that might delay your trip: heavy traffic, closed roads, accidents...

Perhaps an easier explanation is that within driving the course you will encounter several Poisson processes, and the waiting time for a Poisson process is an exponential distribution.


Ironically, that anecdote also paints a good reason to sample the data on how long it takes to get there. If you check your own history for how long it takes to get somewhere, it is probably far more predictable than you'd expect.

Of course, this also explains why software estimates are hard. We are all too often estimating things we haven't done before.


> The problem is that a lot of tools managers will use to try to make schedules implicitly assume that errors will essentially be Gaussian.

No, the fundamental problem is that someone who provides a genuinely accurate schedule always loses against someone who provides a politically palatable schedule.

Until you fix this, scheduling will always be inaccurate.


That is a true problem, but an orthogonal one to what I am talking about. Gantt charts subtly, but deeply, work in the assumption that task failures will be Gaussian.

There are many problems with the whole "I can schedule the world months in advance" in the software world, even these two aren't all of them. There's a reason it isn't just sort of a failure, but such a comprehensive failure, across decades, across all scales of projects, or, rather, there's a lot of reasons.


This someone is perhaps not affected by a missed schedule as badly as others. It is rational to provide too optimistic a schedule. And if something goes amiss, find a political solution. One is to find a scapegoat.


I have been working in custom steel fabrication for most of my career, and our distributions are all fat-tail distributions. The main trick we've learned is to focus on and obsess over outliers at every stage of moving through our shop. If something has been sitting for more than three days, it's brought up in every meeting until it's moving again.

The hardest part to communicate to accountant types and management types (especially if they come to our company from a manufacturing background) is that our system is and always will be a chaotic system, and therefore inherently unpredictable.

The reason we are a chaotic system is because the base element in our system is a person, and people have inherently chaotic productivity output.

Another major contributor to the chaos is the Pareto distribution among workers of productivity. When scheduling plans a task to take 10 hours, what they can't account for is that if employee A does it, it will take 4 hours, and if employee B does it, it will take 25 hours.

I could go on and on with other layers of complexity that create long fat-tail distributions for us, but you get the point.


Actually, the reason that real world isn't always Gaussian isn't only that sums take a long time to become Gaussian. Another factor is that the Gaussian isn't the only distribution that's "stable" in summation.

Yes, a sum of random variables that satisfies the central limit theorem will be Gaussian but this requires that the random variable have finite mean and variance and there are plenty of random variables/distributions not satisfying this. In general, the L-stable or stable family of describes the general case here.

Notably, Benoit Mandlebrot's study of fractals began with the study of non-Gaussian distributions in financial markets and other situations. For example, Financial markets are a sum of price jumps but those jumps aren't of similar size - rather a few large jumps can have bigger impact than all the small together.

See: https://en.wikipedia.org/wiki/Stable_distribution


While all true, I think variables with infinite variance are hard to come by. For a start, almost everything in the real world is kind of bounded (markets can't go below zero and also can't jump by more than a certain amount per day either, laws of physics determine min and max weight of people etc).

But it's true that CLT requires some form of IID distribution, or something slightly weaker, and that often doesn't apply.


> markets can't go below zero and also can't jump by more than a certain amount per day either

It depends on what you are looking at, and how you are modelling it, isn't it?

For example: the value of log(asset_value_today/asset_value_yesterday) can become unbounded (and, thus, so can all its moments).

Furthermore, the problem is not just the "finite variance assumption", but the fact that the CLT only talks about what happens asymptotically: for some reason people use t-scores in practice, rather than z-scores. CLT does not imply normality of (e.g.) sum-based estimators if you have finite number of samples (which is pretty much always the case).


I'd go farther and say the real central limit theorem is the generalized central limit theorem, and the finite variance central limit theorem is just a quirky special case.


Oh boy, wish I had read this 6 years ago. I've had little stats training, and worked a bit in a research hospital. People would NEVER plot distributions, just give the mean. Frequently I had to beg for error bars. Eventually I started voicing my belief that everybody should be validating their distributions (which was ALWAYS assumed to be a Normal dist), because we HAVE the data, and in an earlier life I been working with log-normal distributions for which the mean was a useless measure. Couldn't convince anyone, probably because I never could state it this clearly and this well supported with stats knowledge. One of the reasons I left, this incredible creative poverty.


I took a lot of probability and statistics in college (I was a Math major); I tend to assume that very few people using statistics understand what they are doing well enough to avoid serious mistakes. True story: for one class project, I was directed to find a published paper, acquire the same data set, and verify the results. I contacted the author and got his actual data, ran the model (ordinary least squares) and verified the result; then I checked for heteroskedasticity, found it, corrected for it, and determined that the correction made all of his results go away. His model was completely inconclusive. (The author was a university professor in economics; the paper was based on his doctoral dissertation.)


This is a feedback problem.

Most users of statistics have their results validated not against reality, but in the eyes of other users of statistics. This means if the process you follow sort of looks right, the result will be accepted as true, and you will never learn otherwise.

Finance people have a much more nuanced understanding of statistics, because their use of statistics results in red or black numbers at the end of the year.


Yup, this is depressingly normal (heh). Stats is hard, and people don't interrogate their results when they like them (unfortunately).


> So here’s my radical proposal: instead of mindlessly fitting a theoretical distribution onto your data, use the real data. We have computers now, and they are really good at performing the same operation over and over – exploit that. Resample from the data you have!

Hmm.

The problem with fat-tailed distributions is that often the real data doesn't actually show you how fat-tailed the distribution is. The author notes this later, but doesn't discuss the difficulty it creates for the suggestion of "using the real data":

> In other words, a heavy-tailed distribution will generate a lot of central values and look like a perfect bell curve for a long time … until it doesn’t anymore. Annoyingly, the worse the tail events, the longer it will keep looking like a friendly bell curve.

This is all true, but the author misses the immediate corollary: the worse the tail events, the longer using the real data will mislead you into thinking the tails are fine. Using the real data doesn't solve the fat tails problem; in fact, resampling from the real data is sampling from a bounded distribution with zero tails.


From the Statistical Consequences of Fat Tails (N. Taleb):

"We will address ad nauseam the central limit theorem but here is the initial intuition. It states that n-summed independent random variables with finite second moment end up looking like a Gaussian distribution. Nice story, but how fast? Power laws on paper need an infinity of such summands, meaning they never really reach the Gaussian. Chapter 7 deals with the limiting distributions and answers the central question: "how fast?" both for CLT and LLN. How fast is a big deal because in the real world we have something different from n equals infinity."

https://arxiv.org/abs/2001.10488


CLT doesn't say that distributions become Gaussian, it says that if you take sets of samples from a distribution, the distribution of means of those sets of samples is often Gaussian (under certain assumptions) even if the underlying distribution is not Gaussian. It's incredible how often people get this wrong.

The distribution of means of samples is not always Gaussian (eg CLT doesn't hold if the underlying is a Cauchy distribution for example) and the underlying distribution won't magically "become Gaussian" even if you wait a long time unless it was Gaussian to start with.


Your first paragraph is how I had understood the CLT as well. I’m wondering now if the author of the post knows more than me and is correct about what he’s talking about, or if he’s wrong.


If the mean is Gaussian then so is the sum – one is simply the other scaled by a non-random amount.

More formally, if we have x ~ N(m, s) then nx ~ N(nm, sqrt(n)s).


The sum of a linear combination of Gaussians is Gaussian but that's not what the CLT is talking about. It is saying that the distribution of _sample means_ taken from some distribution is Gaussian. The distribution of means of samples talked about by the CLT is quite different from the underlynig distribution. Its variance is \sigma^2/n if the original distribution has variance of \sigma^2 for example. It does have the same mean as the underlying distribution however.

Edit to add: Here's a small python program so you can run it in a jupyter notebook or something and see for yourself:

    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    broken_dice_probabilities  = [int(x!=3)*0.2 for x in range(6)]
    broken_dice_sides = [x+1 for x in range(6)]
    n=1000000
    population = np.random.choice(broken_dice_sides, replace=True, p=broken_dice_probabilities, size=n)
    ax=sns.histplot(population)
    ax.set_title("Underlying population")
    plt.show()
    n_samples = 100000
    sample_size = 20
    clt_samples=np.zeros(n_samples)
    for i in range(n_samples):
        this_sample = np.random.choice(population, replace=True, size=sample_size)
        clt_samples[i]=np.mean(this_sample)

    ax=sns.histplot(clt_samples)
    ax.set_title("Sample means")
    plt.show()
I make a very obviously non-Gaussian distribution - a broken 6-sided die that cannot roll a four. You see the underlying distribution is uniform except one number where p=0, then I take samples from this distribution and plot their mean. Seeing is believing, the distribution of these is Gaussian. Now, note that you can keep sampling for infinity from the underlying distribution - the results will never be Gaussian.


If the distribution of sample means tends to Gaussian, then the distribution of sample sums tend to the same Gaussian (except scaled.)

So if the original X_i are drawn from any distribution with defined with mean m and variance s², then the CLT says that in the limit, the sample mean is distributed N(m, s), and, equivalently, the sample sum is distributed N(nm, sqrt(n)s).

The question is when the "in the limit" results starts being practically useful, and that depends on the distribution of the original X_i.


Yes. Sorry I misunderstood your point.


The author's stock market example fails to talk about percentage returns and makes no sense to me.

"Will a month of S&P 500 ever drop more than $1000?

Using the same procedure but with data from the S&P 500 index, the central limit theorem suggests that a drop of more than $1000, when looking at a time horizon of 30 days, happens about 0.9 % of the time."


Using the (log) returns would certainly be the usual way to approach this.


> So here’s my radical proposal: instead of mindlessly fitting a theoretical distribution onto your data, use the real data. We have computers now, and they are really good at performing the same operation over and over – exploit that. Resample from the data you have!

I'm just getting into this for the PhD work:

"Adèr et al. recommend the bootstrap procedure for the following situations:[0]

        When the theoretical distribution of a statistic of interest is complicated or unknown. Since the bootstrapping procedure is distribution-independent it provides an indirect method to assess the properties of the distribution underlying the sample and the parameters of interest that are derived from this distribution."
[0] https://en.wikipedia.org/wiki/Bootstrapping_(statistics)


Nonparametric Statistics to the rescue!


This, along with the conclusion, is wrong in the first example:

  N(μ=30×0.2,σ=30×0.9).

  With these parameters, 16 MB should have a z score of about 2, and my computer Using slightly more precise numbers. helpfully tells me it’s 2.38, which under the normal distribution corresponds to a probability of 0.8 %. 

  [...]

  The last point is the most important one for the question we set out with. The actual risk of the files not fitting is 6× higher than the central limit theorem would have us believe. 
The probability implied by the normal distribution is 2.1%, not 0.8%:

  > 1 - pnorm(16, 30 * 0.2, sqrt(30) * 0.9)
  [1] 0.02124942
So the actual probability of the files not fitting is only 2.26 times higher than implied probability. I think this is actually quite impressive. At n = 30, we are talking about a very small sample, one for which we don't usually use the central limit theorem.


To be clear, the recommendation of N=30 came from the LessWrong article the author is critiquing, not the author themselves.


And be especially careful if your distribution is bounded and you are somewhat near the bound.... (Gammas come in handy)


Yep - use whenever possible Poissonian processes.

Have you seen an image with real (non-simulated) noise denoized by a Wiener filter? It depends critically on that noise being gaussian - an extremely common assumption.

I guess I don’t need to tell you, to put it politely, it doesn’t work that well…


What leads us astray is assuming that things are similar just because they have the same name. There's not much similarity between a short Markdown file and a full length movie file. A glass of water and the Pacific ocean are both bodies of water.



I, a dumb person, simulated sequences of draws and convergence seems way slower than the first poster indicates.

# R lang

samples = 1000

sample_len = 10000

x <- sapply(1:samples, function(i) cumsum(runif(sample_len)))

check_sample_at_len = sample_len

plot(density(x[check_sample_at_len,]))


Great article!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: