Hacker News new | past | comments | ask | show | jobs | submit login

Thats... not how statistical significance works.



Sure it is. 60,000 weddings * 0.02% is an expected number of 12 positive examples, which really isn't much. Assuming a binomial process, n=60000 and p=0.0002 gives a 95% confidence interval of 5.2 to 18.6, which is a really wide range when you want to show trends. I don't know if the percentages are by year, but if they are the issue is even worse.

The post just does a good job of hiding it by smoothing the plots. Compare an unsmoothed plot: http://www.weddingcrunchers.com/?q=Democrat%20%2B%20Democrat... with the smoothed plot in the article: http://s3.amazonaws.com/rapgenius/HhvuocYI3raAnYpWPE4HaeCh9a...

While the % of republicans does appear to fall, the % of democrats in the last year is lower than in the first year, the opposite of the conclusion they want you to draw!


If this is like most n-gram analyses, the percentages are of the total corpus, i.e. percentage of words, not articles. So 60,000 articles could be 12,000,000 words and 2400 positives if there are 200 words per article (a SWAG).


Looks like you are right. From the FAQ:

>What does the y-axis mean exactly? The y-axis represents the frequency of each phrase, as a percentage of all phrases that contain the same number of words. For example, if you search for from New York, the graph shows the number of times those words appear in exact order, divided by the total number of 3 word phrases in all of the articles

I think doing it at a per-article level makes more sense for an analysis like this, but 0.02% is actually pretty significant when n is on the order of millions.

Thanks for the clarification.


So the implication is they took an SEO friendly subject likely to have plenty of interesting factoids and then went fishing for interesting insights - and write a blog post about it. Page three of the Startup-guide-to-SEO-effectiveness


It could have just barely achieved statistical significance, but it would be hard to draw conclusions from it. Presumably the other 99.98% of families had political preferences, too. We just don't know what they are. And we don't know what caused that tiny percentage to share theirs.

It wouldn't be a bad idea to factor in the number of Democrats vs Republicans holding offices in the area around NYC during that time, either. I know NY state leans Democratic, and Democrats do well in city-level elections. Holding an actual office would probably make you more likely to mention your party.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: