Think Stats: Probability and Statistics for Programmers

tokenadult · on Aug 28, 2011

My favorite references on statistics get to an idea that I think you will be inclined to emphasize: DATA is what matters in statistics, more than mathematical manipulation. See what you think about these, and best wishes for further revisions of your interesting online guide.

http://statland.org/MAAFIXED.PDF

The first link is a BEAUTIFUL and thought-provoking discussion of what's dangerous about having a new mathematics professor choose the statistics textbooks for an introductory college class in statistics, with advice on what to look for in statistics textbooks.

http://escholarship.org/uc/item/6hb3k0nz

The second link is by a very famous statistician, with discussion of how the statistics curriculum could be revised to better emphasize the most important ideas.

Put the key ideas from these two resources in your own words, and you will have a good guide to programmers about how to think about statistics.

AllenDowney · on Aug 28, 2011

Very interesting. I certainly agree with the idea from the first article, on the importance of data, and from the second article, on the usefulness of simulation.

AllenDowney · on Aug 30, 2011

I read the first article more carefully, and it inspired this new blog post: http://allendowney.blogspot.com/2011/08/jimmy-nut-company-pr...

mofeeta · on Aug 29, 2011

Your second link was an eye-opener. Can you recommend any textbooks or an online class (or its remains) that teaches Cobb's recommended curriculum?

tokenadult · on Aug 29, 2011

Cobb, in some writing of his, recommends the textbook Statistics: Learning in the Presence of Variation,

http://www.amazon.com/Statistics-Learning-Presence-Variation...

which I bought just more than three years ago after I read that. The textbook is quite thought-provoking, not just another Tweedledum to the usual Tweedledee of undergraduate statistics textbooks.

mofeeta · on Aug 30, 2011

Thanks, I'll check it out. It looks like he has authored a couple of stats books, too, that might be worthwhile.

JeanPierre · on Aug 28, 2011

If you've gone through this book and want to learn more ways to use statistics and probability, you should look at the videos from mathematicalmonk[1] on youtube - he's like Khan for Machine Learning and more rigorous probability. It helps learning Machine Learning and Hidden Markov Models pretty well, at least from my experience.

http://www.youtube.com/user/mathematicalmonk

tathagatadg · on Aug 29, 2011

Awesome .. just what I was looking for!

mvzink · on Aug 28, 2011

Who's going try reading some of this before doing the ML or AI Stanford courses?

timruffles · on Aug 28, 2011

Very much so. Still trying to find out exactly what maths will be required, I'm looking at things from related courses http://see.stanford.edu/see/courses.aspx.

kanwisher · on Aug 28, 2011

Hah funny same here, trying to beef up stats and linear algebra. Quite excited about the ML Class.

_w2pa · on Aug 28, 2011

That's partly why I bought it. My university didn't have a statistics department, so I never had the chance to take a statistics course that was worthwhile.

Statistics and linear algebra really should be required by all CS programs. It's funny that at many schools those courses are not, yet Calculus is. First, Calculus should have been handled in HS. Second, I've never had a use for Calculus professionally or for anything I've worked on in my free time.

stiff · on Aug 29, 2011

Calculus is a major prerequisite for both a reasonably serious statistics course, and for a reasonably serious AI/ML course. Furthermore, at least for ML it is multi-variable calculus based on linear algebra, I doubt you could learn this in high school at the level needed.

_w2pa · on Aug 28, 2011

And on that note, does anyone have any other good book suggestions regarding ML?

maurits · on Aug 28, 2011

Of the top of my head,

- Machine Learning by Tom M Mitchell http://www.cs.cmu.edu/~tom/mlbook.html

For general reading and introductions I also like:

- Pattern Classification by Richard Duda

- Pattern Recognition and Machine Learning by Christopher Bishop

For a bit more emphasis on statistics and math, I usually dive in to

- Classification,Parameter Estimation and State Estimation by van der Heijden

And last, but certainly not least:

- Information Theory, Inference, and Learning Algorithms by David MacKay, available here:

http://www.inference.phy.cam.ac.uk/mackay/itila/

_w2pa · on Aug 29, 2011

Excellent, many thanks.

I've read O'Reilly's Collective Intelligence. It's a great introductory survey, but it was very light on theory.

I also own Collective Intelligence in Action. It had more explanation of theory than O'Reilly's offering, but most of the chapters devolved into how to use Java data mining framework X.

DanielRibeiro · on Aug 28, 2011

No mention of Z-test at all[1]? Guess, at least for hypothesis testing, Wikipedia did a more comprehensive job[2]

[1] http://en.wikipedia.org/wiki/Z-test

[2] http://en.wikipedia.org/wiki/Statistical_hypothesis_testing

AllenDowney · on Aug 28, 2011

Rather than get into a catalog of tests, I take the approach that "There is only one test." I wrote more about it here: http://allendowney.blogspot.com/2011/05/there-is-only-one-te...

Jach · on Aug 29, 2011

Furthermore, another reason the entire classic test-based approach is bad since it encourages having binary hypotheses when many real-world problems don't, they may be many or composite or even infinite. (Of course, many real-world problems can be reduced to binary ones, which is one reason the approach became popular.)

If you know enough probability theory, statistics is just a special case. The nice thing about using probability theory is if you do decide to use a 'test', all of your assumptions are put forth first. As E.T. Jaynes says:

    In estimating a location parameter, for example, the sample median M is often cited as
    a more robust estimator than the sample mean. But here it is obvious that this
    ‘robustness’ is bought at the price of insensitivity to much of the relevant
    information in the data. Many different data sets all have the same
    median; the values above or below the sample median may be moved about arbitrarily
    without affecting the estimate. Yet those data values surely contain information
    highly relevant to the question being asked, and all this is lost. We would have
    thought that the whole purpose of data analysis is to extract all the information
    we can from the data.
    
    Thus, while we agree that robust/resistant properties may be desirable in some cases,
    we think it important to emphasize their cost in performance. In the literature,
    ad hoc procedures have been advocated on no more grounds than that they are ‘robust’
    or ‘resistant’, with no mention of the quality of the inference they deliver, much
    less any comparison of performance with alternative methods; yet alternative methods
    such as Bayesian ones are criticized on grounds of lack of robustness,
    without any supporting factual evidence.

A recent probability book I've started that I think is pretty good is http://uncertainty.stat.cmu.edu/

_w2pa · on Aug 28, 2011

Great book. But I made the mistake of buying this book on Kindle just a few hours before I saw it show up here.

:(

AllenDowney · on Aug 29, 2011

Sorry! It's really not meant to be a trick. The book is under a CC license. I provide the PDF version at Green Tea Press, the LaTeX source for anyone who wants to make a modified version, and a not very good HTML version (some of the math is broken). O'Reilly provides the printed and Kindle versions.

Right now all versions have the same content, but I will continue to revise, so you can think of the version on Green Tea Press as the draft of the second edition.

I know it can be confusing, but I hope the benefits of the free license make up for it.

lylejohnson · on Aug 30, 2011

Thanks for the clarification, Allen. It's a great tutorial, I was glad to pay for it, and I will be recommending it to others!

mahmud · on Aug 29, 2011

Green Tea Press are good guys, I would never feel sorry for paying them; they have done far more for humanity than my $30 ever will: http://greenteapress.com/

lylejohnson · on Aug 29, 2011

I'm a little confused about the various editions. Just a few weeks ago I bought the "Think Stats" ebook published by O'Reilly (same author). Is it all the same content?

rodh257 · on Aug 29, 2011

kindle native version should be much easier to read than trying to view html/pdf on the kindle, and you've supported the author