Hacker News new | past | comments | ask | show | jobs | submit login
The Central Limit Theorem and Sampling (luminousmen.com)
52 points by luminousmen on July 7, 2019 | hide | past | favorite | 8 comments



> According to the central limit theorem, the average value of the data sample will be closer to the average value of the whole population and will be approximately normal, as the sample size increases.

This is a pretty loose statement of the central limit theorem. Sample averages converge to the population average by the strong law of large numbers (almost surely, under some mild conditions). The central limit theorem is a statement about the differences between sample averages and population mean. Multiply the differences (sample avg. - pop mean) by sqrt(sample size) and let sample size go to infinity to converge to a Normal distribution. (under some stronger conditions).


This is 1st year undergraduate statistics and has nothing to do with data science. The graphics are cute though. I'm sorry for being so blunt.


> this is true regardless of the distribution of population

Nope. Not true for all the distributions. For example, stock prices cannot be used in central limit theorem.

Distributions must have second moment, in other words, have finite variance.


Stock prices surely have finite variance... but it’s true that it can be high enough to be problematic and if you choose to use an infinite-variance distribution to model prices you may need to pay attention to the consequences.


> Stock prices surely have finite variance.

Are you sure? On my data, they are most closely modeled by Pareto distribution (infinite variance).

As far as I remember, that was discovered by Mandelbrot, studying cotton stocks.


Do you really think there is a non-zero probability of the price of a stock, say Apple, closing today over $1000? Over $1mn? Over $1bn? Over $184467440737095516.15 which is the largest number of cents that can be stored as a 64-bit integer? Over 10^100 dollars?


It is easy to find this material. This link adds nothing to the vast amount of tutorials in probability.


They are giving sampling examples as if I, as a data scientist - would be sampling actual human population.

Give me real examples of data sampling in the wild - how did you obtain this and that dataset? How did you clean up data x? How did you infer that the API provider’s server was misconfiguring the parameter X and therefore 10% of our cashflow was attributed to Wednesday last week instead of today?

This feels more like wikihow.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: