The Central Limit Theorem and Sampling

clircle · on July 7, 2019

> According to the central limit theorem, the average value of the data sample will be closer to the average value of the whole population and will be approximately normal, as the sample size increases.

This is a pretty loose statement of the central limit theorem. Sample averages converge to the population average by the strong law of large numbers (almost surely, under some mild conditions). The central limit theorem is a statement about the differences between sample averages and population mean. Multiply the differences (sample avg. - pop mean) by sqrt(sample size) and let sample size go to infinity to converge to a Normal distribution. (under some stronger conditions).

Rainymood · on July 7, 2019

This is 1st year undergraduate statistics and has nothing to do with data science. The graphics are cute though. I'm sorry for being so blunt.

deepsun · on July 7, 2019

> this is true regardless of the distribution of population

Nope. Not true for all the distributions. For example, stock prices cannot be used in central limit theorem.

Distributions must have second moment, in other words, have finite variance.

kgwgk · on July 7, 2019

Stock prices surely have finite variance... but it’s true that it can be high enough to be problematic and if you choose to use an infinite-variance distribution to model prices you may need to pay attention to the consequences.

deepsun · on July 10, 2019

> Stock prices surely have finite variance.

Are you sure? On my data, they are most closely modeled by Pareto distribution (infinite variance).

As far as I remember, that was discovered by Mandelbrot, studying cotton stocks.

kgwgk · on July 10, 2019

Do you really think there is a non-zero probability of the price of a stock, say Apple, closing today over $1000? Over $1mn? Over $1bn? Over $184467440737095516.15 which is the largest number of cents that can be stored as a 64-bit integer? Over 10^100 dollars?

gran_colombia · on July 7, 2019

It is easy to find this material. This link adds nothing to the vast amount of tutorials in probability.

heavenlyblue · on July 7, 2019

They are giving sampling examples as if I, as a data scientist - would be sampling actual human population.

Give me real examples of data sampling in the wild - how did you obtain this and that dataset? How did you clean up data x? How did you infer that the API provider’s server was misconfiguring the parameter X and therefore 10% of our cashflow was attributed to Wednesday last week instead of today?

This feels more like wikihow.