How Not to Lie with Statistics: Avoiding Common Mistakes (1986) [pdf]

capnrefsmmat · on Sept 22, 2014

I agree with most of the points here, although I've never seen anyone seriously attempt regression on residuals (instead of multiple regression) outside of an introductory regression course. (Actually, I graded a homework problem on it just last week.)

With all the data collected by tech companies these days, I'd be worried about other problems. It's easy to dig through loads of variables looking for correlations, and you'll inevitably find false positives. If you dig deep enough in your data, looking for differences in conversion rates between Southeast Asian Chrome users and Nordic users of Opera Mini, then you'll also have poor statistical power and end up with wildly exaggerated results.

(I am slightly biased here because I have written an entire book on the subject: http://www.statisticsdonewrong.com/)

tjradcliffe · on Sept 22, 2014

The paper is old, and regression on residuals was much more common back in the day. There were notable dust-ups within some small subsets of the physics community due to its use in a few cases.

Your book looks like a good read. It's always nice to see someone talk about stopping rules in particular, which is one of the nastiest problems in experiment design for both practical and theoretical reasons--the practical problem is really one of economics, and we just don't train researchers properly for that.

brockf · on Sept 22, 2014

This paper was just published this year in a reputable journal, and highlighted ongoing misunderstandings about what regressing on residualized variables actually does: http://www.clas.wayne.edu/Multimedia/wurm/files/Wurm%20%26%2...

The problem continues, for sure!

capnrefsmmat · on Sept 22, 2014

Oh, that's interesting, thanks. It's amazing what weird strategies people come up with when analyzing their data.

jzwinck · on Sept 22, 2014

Thanks for posting the link to your book--it's interesting reading.

There seems to be a fairly common error within:

> One 1992 telephone survey estimated that American civilians use guns in self-defense up to 2.5 million times every year – that is, about 1% of American adults have defended themselves with firearms.

We cannot simply divide the event count by the population count, because a single person may have used a gun more than once in a year. In fact, someone who has used a gun during the year is more likely to use one later in the year than someone who has not yet used one, because some people live in dangerous areas, are themselves belligerent, or both.

bainsfather · on Sept 22, 2014

Seconded. A great book (I've read 75% of it so far).

About the error you mention - I think you are right - but the author brings up the ~1% because he is talking about 'base rate fallacy' - he wants to say that the errors from the 99% of the population will swamp the true signal from the 1%. So his ~1% number is likely qualitatively ok for what he is using it for. It should still be reworded though - one wants 0 errors in a book about statistics mistakes :)

capnrefsmmat · on Sept 22, 2014

This came up during editing for the book. I tried to settle it by checking the original survey, which I thought would surely report the number of incidents reported by each respondent, but they make no mention of it. So I can't tell how many people are involved per year.

ExpiredLink · on Sept 22, 2014

> It's easy to dig through loads of variables looking for correlations, and you'll inevitably find false positives.

Cross-validation?

http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...

new_test · on Sept 22, 2014

No, http://en.wikipedia.org/wiki/Multiple_comparisons_problem

danso · on Sept 22, 2014

OT: I've read and accessed this before, but I hadn't noticed the note in which Harvard is soliciting feedback on its Open Access initiative, in which it releases papers such as this...their online feedback form is here:

https://osc.hul.harvard.edu/dash/open-access-feedback?handle...

Hopefully they're getting positive response to open access, and see that both the academic community and public benefit from more open access.

amathstudent · on Sept 22, 2014

Of course, the problem with things like this, Huff's book, everything by David Freedman, etc., is that people want to lie with statistics. To put it more prosaically, people have biases, prejudices, socially-created expectations, ulterior motives, and usually statistics is a more-or-less subtle technique for whitewashing those into 'scientific knowledge'. This happens all across the social & biological sciences, in medical research, and in industry.

mattdeboard · on Sept 22, 2014

Zed Shaw wrote on this topic awhile ago as well:

http://zedshaw.com/essays/programmer_stats.html

thisjepisje · on Sept 22, 2014

From an article on HN a week ago:

In 1900, about 4 percent of the U.S. population was older than 65. Today, 90 percent of all babies born in the developed world will live past that age.