Hacker News new | past | comments | ask | show | jobs | submit login
N-gram Analysis of the New York Times Weddings Section (rapgenius.com)
250 points by lil_tee on Sept 4, 2013 | hide | past | favorite | 64 comments



Another statistical analysis of the NYT wedding section, looking at the occurrence frequency of certain characteristics in the NYT wedding announcements relative to their occurrence in the general population: http://www.theatlanticwire.com/entertainment/2011/12/odds-ge....

You can clearly see the recent tech boom by searching "Google," "Facebook," "Twitter," and "Apple" http://www.weddingcrunchers.com/?q=facebook%2C%20google%2C%2....

The key takeaway here is Google:

Google has raced ahead of establishment NY law firms: http://www.weddingcrunchers.com/?q=wachtell%2C%20cravath%2C%....

Google has also recently overtaken top investment banks: http://www.weddingcrunchers.com/?q=goldman%20sachs%2C%20morg...

Ditto for consulting: http://www.weddingcrunchers.com/?q=mckinsey%2C%20boston%20co....

When do you think Google will start hosting a debutante ball in Chelsea?


You mean Google's yearly Deb Ops ball?


Given how horizontally expansive RapGenius is trying to be (this is the first time I've seen NewsGenius, but I'm familiar with PoetryGenius et al -- there's an annotated Iliad that's pretty cool), I'm wondering if they'd be better off as a layer or a plugin as opposed to a stand-alone site. I'm much less tempted to visit the site for each individual story that pops up than I would be to peruse the annotations as I browse normally.

Either way -- stuff like this is a delight to read.


My understanding was that RapGenius was always a text annotation platform, with the hip-hop stuff serving as an exemplar, rather than being the actual product.


Always a text annotation platform? That's a bit revisionist from what I've read.

Fred Wilson initially told them "I think lyrics is a very crowded space and almost entirely reliant on Google for traffic" and they admitted "our pitch back then was a bit too lyrics-focused.."

http://news.rapgenius.com/Lemon-how-rap-genius-raised-s18m-i...

You can find those quotes in an annotation in the above link (which makes me realize the problem with annotations is you can't ctrl-f them).


Here is a permalink for that annotation: http://news.rapgenius.com/1900809

You can get that by clicking "share" in the annotation footer


Fair enough! I don't have a citation for my understanding, I just read waaaaay too much HN, and that's what was stuck in my brain. Thanks for clarifying.


A layer is not a good idea from a business point of view in my opinion, given how most websites (Facebook, Twitter with their API changes, Spotify with their new messaging system and so on) are trying to lock you in. Maybe simply build a way to take you back directly to RapGenius, like a annotate-this-on-rg button.


Most users have no idea what a plugin is or how to use it.

That said, it might be nice to have an optional plugin to easily discover annotations and add your own to pages as you browse.


Funny you should mention that.

Given that that's the defense people seem to proclaim every time someone mentions that disabling JavaScript is now buried as an arcane config flag in Firefox.

They say: "just use a plug-in", "just use an add-on"...

What they mean to say is: "just don't disable javascript at all... ever."


There is a direct connection between people who want to disable JavaScript (and even know what it is) and people who know what plugins are.

They are the same people.


Most people, especially those who don't know what a plugin is, should emphatically NOT be disabling javascript wholesale and without exceptions in Firefox. And I disagree that its incredibly hard to google, find NoScript, and follow the directions to install it.


"This makes it possible to rigorously test our intuitions about trends like."

let me fix that for you

"This makes it possible to put numbers on our preconceived notions and play around with them."

It may be entertaining, but rigorous? I don't think so.


Yes, generally n-gram-based analyses are a huge minefield. Computational linguists do use them, with a lot of caveats and careful analysis of confounding factors.

One simple one that comes to mind here is that you need to analyze to what extent changes over the period of the data set are caused by underlying societal changes, versus changes in the NYT itself; the end result will be a mixture of those two changes, some of which may be magnifying and others offsetting. The 1980 NYT and the 2013 NYT are not the same newspaper, not edited by the same people, not sold to the same readership demographics, and not soliciting the same advertisers, so it's somewhat questionable to treat it as a stable proxy for a social group.

Another common pitfall is language change screwing up all kinds of measures (since n-gram models just work on word counts). For example, if two words are used roughly interchangeably in 1980, but by 1990 one of them has fallen out of usage, and been replaced wholly by the other one, searches for just the one word will look like the word's on an upwards trend, but it would be misleading to infer an increase in the underlying concept over the period. Of course, you can account for this by merging words into equivalence classes (most analyses will do basic stemming and merging of alternate spellings), but you have to be very careful to get all the equivalence classes (which is not a well-defined notion). Just a list of the top words in a year will tend to be some mixture of 1) top concepts; and 2) concepts expressed using only a small number of wording variations, so their count doesn't get diluted.


yeah the guesses on ethnicity based on common last names? "chang chen wong" ... rigorous indeed.


Not to take an entertaining post too seriously, but when your graph scale ranges from 0 - 0.02% the statistical significance is dubious.


Thats... not how statistical significance works.


Sure it is. 60,000 weddings * 0.02% is an expected number of 12 positive examples, which really isn't much. Assuming a binomial process, n=60000 and p=0.0002 gives a 95% confidence interval of 5.2 to 18.6, which is a really wide range when you want to show trends. I don't know if the percentages are by year, but if they are the issue is even worse.

The post just does a good job of hiding it by smoothing the plots. Compare an unsmoothed plot: http://www.weddingcrunchers.com/?q=Democrat%20%2B%20Democrat... with the smoothed plot in the article: http://s3.amazonaws.com/rapgenius/HhvuocYI3raAnYpWPE4HaeCh9a...

While the % of republicans does appear to fall, the % of democrats in the last year is lower than in the first year, the opposite of the conclusion they want you to draw!


If this is like most n-gram analyses, the percentages are of the total corpus, i.e. percentage of words, not articles. So 60,000 articles could be 12,000,000 words and 2400 positives if there are 200 words per article (a SWAG).


Looks like you are right. From the FAQ:

>What does the y-axis mean exactly? The y-axis represents the frequency of each phrase, as a percentage of all phrases that contain the same number of words. For example, if you search for from New York, the graph shows the number of times those words appear in exact order, divided by the total number of 3 word phrases in all of the articles

I think doing it at a per-article level makes more sense for an analysis like this, but 0.02% is actually pretty significant when n is on the order of millions.

Thanks for the clarification.


So the implication is they took an SEO friendly subject likely to have plenty of interesting factoids and then went fishing for interesting insights - and write a blog post about it. Page three of the Startup-guide-to-SEO-effectiveness


It could have just barely achieved statistical significance, but it would be hard to draw conclusions from it. Presumably the other 99.98% of families had political preferences, too. We just don't know what they are. And we don't know what caused that tiny percentage to share theirs.

It wouldn't be a bad idea to factor in the number of Democrats vs Republicans holding offices in the area around NYC during that time, either. I know NY state leans Democratic, and Democrats do well in city-level elections. Holding an actual office would probably make you more likely to mention your party.


As you said, caveat 'entertaining' blah, blah. That said...

That's actually a slightly dubious analysis. The question you need to ask is 0.02% of what? In this case, I would take a guess it means 0.02% of all the words analyzed. As a very simple example, imagine analyzing all the letter in a book. If English were perfectly balanced, we expect to see all 26 letters at 1/26 or 0.038%, so seeing the letter 'e' appear at say, 1.0% (or even 0.1%) would be a notable statistical result.


I'd say it's 0.02% of all the weddings, at least the axis is "NYT Wedding Frequency" not "Word Frequency".


If all statistics were required to be rigorous, there'd be a whole lot less to read about in the newspaper, that's for sure.


Are all wedding announcements posted, or does The Times have an editorial role in which announcements are printed?


The Times certainly picks and chooses which are printed. That's the whole reason for the interest in the topic. It's a perfect collision between young-adult ambition and old-school establishment vetting.

I understand that, like college admissions, you can hire a wedding planner or consultant who can considerably raise the chances of your wedding being listed.

The NYT obits are another interesting read.


What is the point of this? Is it some kind of odd "old-school" status thing? The upper class equivalent of being on the 8 o'clock news for a 1 minute interview? Although perhaps this is only for people already investing a huge amount of money into the wedding ceremony?


You got it with the first suggestion. It's a fun distraction for the well-bred, nothing more.

Getting a write-up in the times is one affirmation that you're a power couple in a certain northeastern old-school way, or a human interest angle.


We used a high-end wedding planner, and when we asked her what to do for the NY Times announcement, she told us to submit via the form on the website. Maybe she pulled some strings behind the scenes, but it seemed to us that she had no pull whatsoever on that front.


I've read these for years and have known people who appeared.

The factors that enter into getting in (from my observation strictly) are a combination of things like:

- parents who live in ny metro

- the parties getting married living in ny metro

- having gone to school in ny metro

- parents or parties getting married working in ny metro

- what the parents do for a living

- any lineage "grandparent governor of NY"

- what the parties getting married do for a living

- school attended as far as perceived impressiveness

- whether an impressive job or title of any of the parties mentioned.

..and so on. That's off the top.

For example, "physician" and "went to school in NY" is probably almost assured to get the announcement printed.

"father a mechanic, mother a homemaker, inlaws are nobodies, parties are cashiers who work at walmart, no college, live in jersey city"[1] and so on either don't get in, don't care to get in, or don't have the drive to even submit a form to get in.

[1] Unless of course one of the parties is related to a famous former politician or some other mitigating factor.


I found my high school classmate's wedding announcement in the NY Times. She happened to be a doctor.


I know someone who hired a planner who helped them get in. I'm not sure how. To be sure, they were somewhat qualified already, but noticeably below the bar.

I've read the wedding announcements on and off since about 1991, but much less these days, because I only get the online edition now.


Considering that there are 8 million people living in New York, including everyone would present some serious challenges indeed.


I don't have any stats, but I know that very few of those who submit are selected. My wife and I received a long form write-up (just under 700 words). The process started with our submission on the website, but then we got a call from a writer a few weeks later. Between talking to us, our parents, and our minister, I'd say he put at least 10 hours into researching and writing.

A day or two before the wedding, he told us that he wrote both a long form piece and a shorter, more typical piece. He wasn't sure which would get published, but he was obviously pushing for the longer piece to get in. It did.

Oddly enough, our write-up isn't included in the Rap Genius dataset. Maybe it's too recent or the longer write-ups aren't included.


The data-set includes only the announcements, not the stories published under the "Vows" column.


Not to be rude or offend you, but, why? You're saying a newspaper spend 10 hours researching and writing a 700-word article about your wedding? Is this a "local flavour" type article, like they might do about assorted residents? Or are your families famous or ? Did you pay for this kind of placement? Perhaps I'm socially inept here, I just feel like I'm missing something.

As far as being in the dataset, certainly they aren't analyzing every flavour-style thing article, just "announcements"? Just like a 2-page life-in-review article on someone famous when they die doesn't really go in the obit section, does it?


I was surprised that we got in and shocked that they put so much time into the article. I can't believe it makes economic sense to put so much into such a short write-up, but I guess it does for The Times.

Together my wife and I check quite a few boxes that the NY Time typically looks for. We aren't famous or all that noteworthy, but we do have an interesting story of how we met. That was the main focus of the article.



They have editorial discretion. I'd be curious to know what the "acceptance rate", so to speak is... I know for a fact that a couple I know was rejected specifically because one of them was undergoing gender reassignment and the NYT didn't know which personal (gender) pronoun to use for him.


The NYT said they rejected it for that reason, or that is what you friend assumed to be the case?


AP StyleBook says to use the gender the individual identifies as. Thus why all the news media now reports on Chelsea Manning.


Maybe a different section would have printed it. I'm kind of interested.


I am very surprised by the prevalence of "was graduated from". I have only rarely heard that in 'real life', is the NYT's style guide enforcing this usage?


Those are the most popular phrases in 1980. The times no longer uses "was".


I expect so - graduation is something that happens to you


Agreed. When people say, "I graduated college", they are saying the college graduated from them, not they graduated from college. The correct usage should be, "I graduated from college."


What they should do is find the divorce stats and throw that in the mix.


Really interesting - how did you guys downloaded the 60K articles (is there an API i do not know about)? Also - what graphing lib are you using (I see it is not d3)?


The graphing library is highcharts (http://www.highcharts.com/)


A 1987-2007 dataset is available from UPenn:

http://ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2...


Does anyone have any clue how they would download these 60K articles ???


I was expecting something more interesting, but that's probably because I like N-gram analysis, among other things.

This is how we do it (examples below are not weddings, but random topics):

http://blogdotitrendcorporationdotcom.files.wordpress.com/20...

http://blog.itrendcorporation.com/2013/04/10/social-media-on...


How were they able to collect all the Wedding announcements. Doens't NYT limit the number of articles or what portions of text they can retrieve?


Reminds me of something a certain Mr Swartz did ...



This is awesome -- it's kind of like the older Priceonomics blogs which used quantitive analysis to uncover hidden facts in plain sight.


Was this designed specifically for Katie Baker (she writes summaries of the NYT Wedding Section for Grantland)?


Is it just me or does anyone else dislike reading articles like this with dark background and light text?


Personally I prefer it. Dark text on a white background (aka, the norm) often requires me to turn down the brightness on my laptop.


Why is RapGenius interested in this ??


I think they're 25-35 year old guys living in NYC...


Whoever uses two similar shades of blue in a chart should be beaten with a stick until they learn more colors.


Learn black and blue?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: