My favorite references on statistics get to an idea that I think you will be inclined to emphasize: DATA is what matters in statistics, more than mathematical manipulation. See what you think about these, and best wishes for further revisions of your interesting online guide.
The first link is a BEAUTIFUL and thought-provoking discussion of what's dangerous about having a new mathematics professor choose the statistics textbooks for an introductory college class in statistics, with advice on what to look for in statistics textbooks.
The second link is by a very famous statistician, with discussion of how the statistics curriculum could be revised to better emphasize the most important ideas.
Put the key ideas from these two resources in your own words, and you will have a good guide to programmers about how to think about statistics.
Very interesting. I certainly agree with the idea from the first article, on the importance of data, and from the second article, on the usefulness of simulation.
which I bought just more than three years ago after I read that. The textbook is quite thought-provoking, not just another Tweedledum to the usual Tweedledee of undergraduate statistics textbooks.
If you've gone through this book and want to learn more ways to use statistics and probability, you should look at the videos from mathematicalmonk[1] on youtube - he's like Khan for Machine Learning and more rigorous probability. It helps learning Machine Learning and Hidden Markov Models pretty well, at least from my experience.
Very much so. Still trying to find out exactly what maths will be required, I'm looking at things from related courses http://see.stanford.edu/see/courses.aspx.
That's partly why I bought it. My university didn't have a statistics department, so I never had the chance to take a statistics course that was worthwhile.
Statistics and linear algebra really should be required by all CS programs. It's funny that at many schools those courses are not, yet Calculus is. First, Calculus should have been handled in HS. Second, I've never had a use for Calculus professionally or for anything I've worked on in my free time.
Calculus is a major prerequisite for both a reasonably serious statistics course, and for a reasonably serious AI/ML course. Furthermore, at least for ML it is multi-variable calculus based on linear algebra, I doubt you could learn this in high school at the level needed.
I've read O'Reilly's Collective Intelligence. It's a great introductory survey, but it was very light on theory.
I also own Collective Intelligence in Action. It had more explanation of theory than O'Reilly's offering, but most of the chapters devolved into how to use Java data mining framework X.
Furthermore, another reason the entire classic test-based approach is bad since it encourages having binary hypotheses when many real-world problems don't, they may be many or composite or even infinite. (Of course, many real-world problems can be reduced to binary ones, which is one reason the approach became popular.)
If you know enough probability theory, statistics is just a special case. The nice thing about using probability theory is if you do decide to use a 'test', all of your assumptions are put forth first. As E.T. Jaynes says:
In estimating a location parameter, for example, the sample median M is often cited as
a more robust estimator than the sample mean. But here it is obvious that this
‘robustness’ is bought at the price of insensitivity to much of the relevant
information in the data. Many different data sets all have the same
median; the values above or below the sample median may be moved about arbitrarily
without affecting the estimate. Yet those data values surely contain information
highly relevant to the question being asked, and all this is lost. We would have
thought that the whole purpose of data analysis is to extract all the information
we can from the data.
Thus, while we agree that robust/resistant properties may be desirable in some cases,
we think it important to emphasize their cost in performance. In the literature,
ad hoc procedures have been advocated on no more grounds than that they are ‘robust’
or ‘resistant’, with no mention of the quality of the inference they deliver, much
less any comparison of performance with alternative methods; yet alternative methods
such as Bayesian ones are criticized on grounds of lack of robustness,
without any supporting factual evidence.
Sorry! It's really not meant to be a trick. The book is under a CC license. I provide the PDF version at Green Tea Press, the LaTeX source for anyone who wants to make a modified version, and a not very good HTML version (some of the math is broken). O'Reilly provides the printed and Kindle versions.
Right now all versions have the same content, but I will continue to revise, so you can think of the version on Green Tea Press as the draft of the second edition.
I know it can be confusing, but I hope the benefits of the free license make up for it.
Green Tea Press are good guys, I would never feel sorry for paying them; they have done far more for humanity than my $30 ever will: http://greenteapress.com/
I'm a little confused about the various editions. Just a few weeks ago I bought the "Think Stats" ebook published by O'Reilly (same author). Is it all the same content?
http://statland.org/MAAFIXED.PDF
The first link is a BEAUTIFUL and thought-provoking discussion of what's dangerous about having a new mathematics professor choose the statistics textbooks for an introductory college class in statistics, with advice on what to look for in statistics textbooks.
http://escholarship.org/uc/item/6hb3k0nz
The second link is by a very famous statistician, with discussion of how the statistics curriculum could be revised to better emphasize the most important ideas.
Put the key ideas from these two resources in your own words, and you will have a good guide to programmers about how to think about statistics.