Hacker News new | past | comments | ask | show | jobs | submit login
Rappers, Sorted by Size of Vocabulary (amazonaws.com)
600 points by sinned on May 4, 2014 | hide | past | favorite | 267 comments



I enjoyed reading this chart but I hope it doesn't reinforce the bias that some fans have that word complexity is the only way to tell if a rapper is good or not. There are several ways to judge the strength and weaknesses of a rapper. Complexity is one of them, flow is another. Story telling ability is also another very strong in indicator. The best rappers are able to bring a mix while some are just so strong in one area that they explode no matter if they are really weak in other areas.


To fully understand rap, we must first be fluent with its meter, rhyme, and figures of speech. Then ask two questions: One, how artfully has the objective of the song been rendered, and two, how important is that objective. Question one rates the song's perfection, question two rates its importance. And once these questions have been answered, determining a song's greatest becomes a relatively simple matter.

If the song's score for perfection is plotted along the horizontal of a graph, and its importance is plotted on the vertical, then calculating the total area of the song yields the measure of its greatness.


god. fucking. damn.

rips page out of book


I completely agree.


It's kind of like putting book authors in this kind of a list. Using more words and more complex ones doesn't really say anything about the quality of the writing. Pretty interesting data-set for fun though.


William Faulkner once criticised Ernest Hemingway by saying that he "had never been known to use a word that might send the reader to the dictionary."

Hemingway responded by saying, "Poor Faulkner. Does he really think big emotions come from big words? He thinks I don't know the ten-dollar words. I know them all right. But there are older and simpler and better words, and those are the ones I use."


"Give a dog a bone, leave a dog alone..."

There's too much hate for DMX in these comments. Dude has some tracks that just ooze energy.


I never made the claim that complexity is better/worse. I just wanted to communicate the data point, not imply 2pac>DMX.


This is fascinating. I'm only a recent listener of hip-hop (primarily because of Earl Sweatshirt and Odd Future) and I'm in awe of the vernacular.

And similarly, as a boredom exercise a few weeks ago I did some lexical analysis of the song Timber (the monstrosity was being constantly played on the radio at the time) and here's what I came out with:

"83.1% of the words in the lyrics are five letters or less, 58.9% are four letters or less. The lexical density (the number of unique words divided by the total number of words, multiplied by one-hundred) is 29.1%. There is only one word in the song which has three or more syllables. Eleven people were involved with the writing of the song, each of them capable of producing just nine unique words each."


> Eleven people were involved with the writing of the song, each of them capable of producing just nine unique words each.

I'm not sure why this is notable when you consider that lyrics are probably the least important aspect of a song intended for the top 40.

If you take a moment to listen to the melody and production, you'd probably see why it's credited to 10+ people. That song is a well-oiled machine.


My last sentence was intended as a satire, lyrics are obviously not uniformly distributed between writers. I completely agree with you though, the song (despite my disdain for it) is incredibly catchy, and definitely not intended to be thoughtful or thought provoking in nature.


Is that with or without 'the' 'be' 'to' not that its any sort of literary accomplishment regardless but English is a terrible language for lexical density.


this is brilliant. OP here. We should work together.


You must be using big words in case someone does the analysis on HN comments.

Definitions: Vernacular: ?? Lexical analysis (in this case): ratio of unique words to non unique words Lexical density: What persent of the words is unique?


The paragraph in quotes is copied and pasted from when I wrote that a few weeks ago, there's a definition in parentheses following lexical density. Analysis is a word that should not need a definition attached to it (you have used it yourself in your comment.)

Vernacular is commonly used in the United Kingdom. Google will provide you with a definition.


Looked for Canibus near the top and wasn't surprised to find him 4th. If anyone hasn't heard of him, highly suggest listening to his older stuff such as his first Can-I-Bus, 2000 BC and Mic Club.

He raps about science and space all the time which is cool.

Here's an example of his ridiculous lyrics: http://rapgenius.com/Canibus-poet-laureate-infinity-lyrics


Additionally, many HN users have probably already heard Canibus rapping even if they don't know it, since he wrote the Office Space theme song. :)


Always loved that song.

My personal favorite Canibus track is Master Thesis, though.


Yey! I'm not the only Canibus fan! Seriously though Mic Club is awesome.


Many here seem to be interpreting vocabulary size as a signal for quality. When it comes to rap I completely disagree. Firstly, the repetition is rap's main ingredient. I read an article a while ago where researchers found that listening to a spoken phrase that is looped activates the same part of the brain as music, which helps explain this phenomenon.

Personally, if I want food for thought I read. Rap is not an intellectual pursuit. I've been perusing rappers on this list, and the top artists have not been good at all to my ears. It seems that the best rappers are in the middle, and being on either extreme is a negative signal.


> It seems that the best rappers are in the middle, and being on either extreme is a negative signal.

Objectively, I don't think that is an accurate assessment. There are highly respected artists all across the chart, including the extremes. For instance, Wu-Tang Clan and its members at the upper extreme and Kanye, Snoop and Tupac at the lower extreme.


Yeah I agree, that statement was definitely a stretch. There are well respected rappers across the spectrum, which is my main point: that the interpretation of the chart as a proxy for quality is wrong.


> Firstly, the repetition is rap's main ingredient.

Is it though? I don't dispute that sheer vocabulary size isn't a sign of quality, but that seems like a very ignorant generalization of rap.


Repetition is a key ingredient in all genres of music. You would be hard pressed to find many significant pieces of music that don't use repetition.


I think saying it's the "main" ingredient is debatable but it surely is very important.


> Firstly, the repetition is rap's main ingredient.

Sure, within one song. But if an artist has a narrow vocabulary throughout their career, it's a sign that they're just writing the same song over and over.


I never made the claim that vocab size was correlated with quality.


I'm addressing what seems to be a common interpretation, not an explicit claim.


It's mostly the voice...


> Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words

Why does that suggest he knew over 100k words? Maybe it means he knew 28,829 and used all of them? Would he really know over 70,000 words he never used in his works? What would those 70,000 words be? Probably very obscure ones. How can you know that many obscure ones?


Vocabulary has for a while been considered in terms of 'receptive' and 'productive' capacity, with the assumption being that ones 'receptive' vocabulary can be larger, since it is easier to hear/read/understand a word than it is to use it correctly in reading/writing (this is not necessarily the popular opinion anymore [http://www.readingconnect.net/web/FILES/english-language-and...] but may provide the context for the claim about Shakespeare). The notion is that you are able to understand more words than you commonly use in your speech/writing, which is on some level intuitive, although of course it is an empirical question.


Assumptions tend to break down at the extremes.

E.g. Shakespeare actually invented a lot of the obscure words he used.


I imagine that it would be something similar to the German Tank Problem. (http://en.wikipedia.org/wiki/German_tank_problem ) Taking each writing as a sample of the words that are known then would allow for an estimation of total words known. I imagine that this would need to be modified to account for the non-uniform distribution of word use, but the principle would be the same.


Again, I'd like to point out that Shakespeare also made up words: http://www.shakespeare-online.com/biography/wordsinvented.ht...


Relevant video:

Kate Tempest - My Shakespeare: https://www.youtube.com/watch?v=i_auc2Z67OM

;-)


The 28K figure is acheived by counting multiple spellings of the same word. Shakespeare lived before dictionaries, so there was never a single standard way to spell a word.


I'm curious if words that Shakespeare invented count. There are many words that we see first used by Shakespeare, though some of them were probably words invented during his time by others with him merely being the first to record (in documents that survived until today).


The latter is probably more the case. The OED, for instance, has had a bias in favor of using Shakespeare as a word's origin since that dictionary's first edition. The number of words attributed to Shakespeare in the OED has dwindled over time.


Just like you or I know tens of thousands of words, but only use some small subset of them in any given work, you wouldn't expect that Shakespeare would use his entire productive vernacular in producing the limited corpus of his literary works.


Its a nice touch including portmanteaus and 'incorrect' ebonics on the list (like "ery'day"), since authors like shakespeare, joyce and others took the same liberties with language. Arguably, that's how language develops and makes it interesting to study and think about. The OP could have easily stuck to words in the OED, kudos.


op here: thanks yo!


Really interesting, but not as representative as it should be. It's not clear why some have larger vocabulary than others. It could be using words like "zeitgeist" (in case of Aesop Rock) or some clever wordplay (I don't know much about hip-hop, so I can't find example for some artist from the list right off the bat, but I remember Marilyn Manson using word "gloominati" for instance) or pretty meaningless made up words like "schizzle" (in case of Snoop Dogg) or usual derivatives like "fuckedy fuck". Moreover, in many transcripts for hip-hop people write down words as they are pronounced, which can be pretty much distorted for some artists (which of course ideally shouldn't count as a "new word", but that's complicated, yeah).

While Aeson Rock and DMX are clearly extreme and not surprising at all, it's not that clear for some guys in the middle.

So, first off, for every data project sources should be provided, or at least more specific definition, how text was processed, tokenized, analyzed. Second, several more "data slices" should be provided, for instance 100 most used words which are unique for that artist compared to other artist in the list.


The example you used for clever wordplay, "gloominati," is actually considered a portmanteau word. It's the result of combining multiple words to create a new word. (I say this not to be a pedant, but because I learned the term recently and was amused that we actually have a word for it.)


Yes, portmanteaus are exactly what I was meaning by "clever wordplay". English isn't my primary language so it's hard to remember the right word sometimes, sorry for that. :)


OP here. Do I really need to provide all of this to satisfy the reader's ability to grasp the basic premise of the site? this isn't a thesis or academic pursuit, just comparing some rappers for fun.

I used plain NLTK token analysis on rap genius lyrics. in terms of several more data slices...I agree that there should be more cuts of the data, but you must understand the amount of time that it took me to put this together.


Of course it's entirely up to you what to provide. It would be silly of me to question that. I'm not paying you for that, so how can I demand anything? I'm just saying what I think should've been done. That's how I would do it, at least.

You see, I have a strong opinion on that any data-analytic work is pretty close to being useless if it's not reproducible. And I mean really close. I already mentioned few questions that naturally arise reading your article which are crucial to understand your results and are not addressed by the article. So, ideally, any data analysis done for the open community should provide both full dataset and full sources. Unfortunately, it's not always possible: dataset may be several terabytes long, there might be legal issues about disclosure of the data (or letting everyone know how exactly author acquired them), sources might be under NDA, whatever. In this case description should be as pedantic as possible, because some little details can change the whole meaning of statistics entirely.

By the way, NLTK allows you to do all kinds of processing, so it isn't the answer.

So what would I do? The usual course, actually: would do the work in iPython Notebook, cleaning it up afterwards, would've drawn graphs in place and printed some few slices I've said already while processing, so it would be easier to understand what actually that unique words counted might be like. While fancy d3 graphs are cool for sure, but not nearly as useful.


Maybe this is just me, but it's a little unfair to compare to literary texts.

Humour me for a moment.

When an artist writes a song, he (or she) has constraints. Most rappers would like to rhyme the ends of their sentences. I know sometimes they don't (like poetry), but it's certainly pleasing to the ear to have that constraint. Artists endeavour to make their songs catchy, that's highly correlated with the gross sales of the product.

When an artist writes a novel, this constraint is not weighted quite as highly. I know Shakespeare wrote poetry, too, and to call me out on this comparison is entirely fair. That said, there's also an argument to be made for eye rhymes. Shakespeare used these a lot. Eye rhymes are words that don't rhyme aurally, but do rhyme visually. It's the story that pleases the reader, not necessarily its aural 'catchiness'. I probably made that word up. But Shakespeare made words up too. The point is, you knew what I meant.

At the end of the day these comparisons, while certainly interesting, should be taken with a pinch of salt. While I'm at it, this advice can easily be extrapolated to any dataset. Always understand there may be unknown correlations.


OP here: the shakespeare thing is really just a hook, food for thought rather than an academic/cultural judgement.

I also had several suggestions to use shakespeare's sonnets rather than plays, which I should have done.

and yes, this is all just pinch of salt barbership discussion :)


Is Del tha Funkee Homosapien on this list? I'd be curious, since he has pretty non-standard lyrics.


The author promised Reddit that he would be added.

http://www.reddit.com/r/Music/comments/24omhw/rappers_sorted...


I'd never heard of Aesop Rock before and decided to check out some of his music. He sounds a lot like Del.


They are also on the same same label. Definitive Jux has some quality product.

If you're looking for expansive vocabularies you should consider exploring other dorky rappers like Scroobius Pip.


I think I remember reading Aesop say Del influenced him a great deal. If you like Aesop, check out El-P, Cannibal Ox, Illogic, and basically anyone who was associated with Def-Jux.


Not surprised to see Wu Tang at the top and Drake at the bottom. Started from the bottom ... still there.


Haha I was thinking that as you move left on the scale the more likely you are to see rappers that people tend to mock.


This looks at the first so many lyrics in each rapper's career. Aesop Rock came out with some weird stuff right off the bat. I wonder if some of these other rappers became more sophisticated over time. Maybe an average per song would be better, or average uniques per word, would be better.


The problem with average per song is that you "use up" words in every new song, so all things being equal each marginal song has progressively fewer new words.


I bet you could get something insightful from plotting "unique words" versus "total words" - That might give a good idea of the amount of repetition over time, the length or quantity of output, and the total vocabulary.


here's what this looks like. ugly as sin as useless for comparing rappers.

http://www.mdaniels.com/vocab/scatter.png

love your other ideas – hopefully can do them later.


Strange comment, you realise that's not an inherent truth of language? Unique words per song is trivial to calculate


If a rapper released one song using n distinct words their score would be n/1, and if they released a second song using the same set of words their score would halve, to n/2, despite the fact their demonstrated vocabulary is still n words.

In fact, if their first song used n distinct words and their second used a completely distinct set of words, but the second song was shorter than the first, their score would drop.

That would be unusual behaviour for a measure of vocabulary.


I don't think that's what the poster meant. By "average unique words per song" I take it to mean, within each song words are only counted once, but across songs, words can be counted multiple times. So if song A had the words "I like cats" and song B had the words "I like dogs", then the average unique word count would be ((3 + 3) / 2) = 3, not ((3 + 1)/2) = 2.


That's definitely one solution, but it still wouldn't quite capture it. As an extreme example, if rapper A produced 100 songs, each with exactly the same lyrics, they should surely be penalized compared with rapper B producing 100 songs with no shared words— even if rapper A's average unique-words-per-song is higher than rapper B's.


I agree, perhaps the 35,000 most recent words would be better.


OP here: the challenge is that most artists' best work is in their earlier years. I'd rather have Jay-z's first album than last, ya know?


Would sorting by popularity, or critical acclaim, or something along those lines be a possibility?


For those who aren't familiar with Aesop Rock, I'd invite you to give him a listen sometime. His earlier albums, in particular, have been very influential to me in many ways. Both in my artistic and professional careers.

From comments on the conditions of the working man and the condition of feeling trapped in a "j-o-b"[1]:

   "Now we the American working population
   Hate the fact that eight hours a day
   Is wasted on chasing the dream of someone that isn't us
   And we may not hate our jobs
   But we hate jobs in general
   That don't have to do with fighting our own causes
   We the American working population
   Hate the nine-to-five day-in day-out
   When we'd rather be supporting ourselves
   By being paid to perfect the pastimes
   That we have harbored based solely on the fact
   That it makes us smile if it sounds dope"
To storytelling masterpieces regarding living and dreaming[2]:

   "Look, I've never had a dream in my life
   Because a dream is what you wanna do, but still haven't pursued
   I knew what I wanted and did it till it was done
   So I've been the dream that I wanted to be since day one!"
Aesop Rock takes language and linguistics to entirely different levels than one might expect from the single genre that is hip-hop. He even challenges himself and the listeners, playing fantastic word games, for instance re-using the letters L, S, and D in odd and rhythmical ways after a mention[3]:

   "Lazy summer days
   Like some decrepit landshark dumb luck squad dog lurks sicker deluded
   Last sturdy domino lean's secluded
   Don't let stupid delusions lesson super-duty labor students
   Dragnet lifer solutions
   Daddy loved sloppy dimensions like son-daughter links
   Such determinated lepers, successfully disheveled
   Little soliders developed like serpents despite life sentence ducking
   Lemmings
   Some don't like sobriety's dirty lenses
   Some do"
And then there are just incredible gems that stick with you like[4]:

   "I don't flick neeedles like my sick friend
   I don't march like Beetle Bailey through a quick trend
   I don't frequent church's steeples on my weekend
   And I don't comment if you formulate a weak Zen"
There's a lot to explore from Aesop Rock. Should you find this type of hip-hop interesting, a decent place to start is with the label you can find these songs on, Definitive Jux[5]. Incredible talent has been on and off that label over the years. So much good stuff.

[1] - "9-5ers Anthem" - http://rapgenius.com/Aesop-rock-9-5ers-anthem-lyrics

[2] - "No Regrets" - http://rapgenius.com/Aesop-rock-no-regrets-lyrics

[3] - "The Greatest Pac-Man Victory in History" - http://rapgenius.com/Aesop-rock-the-greatest-pac-man-victory...

[4] - "Save Yourself" - http://rapgenius.com/Aesop-rock-save-yourself-lyrics

[5] - http://en.wikipedia.org/wiki/Definitive_Jux


I don't know man, I listened to a couple of the tracks and he definitely has lyrical skills, and I like some of the tracks, but the quotes you selected aren't very good at all, at best obvious topics with all the insight of a million college freshmen. Having said that I like "None Shall Pass" that has a really great sound.

To be entirely honest, I love rap, but not for any insight rappers have in world affairs, but for their lyrical ability. Some are very good at providing unique ways to describe their own insights about their lives but when someone starts rapping about world problems I just want to shut my brain off because it's usually pretty banal. Then with my brain off I can still at least enjoy the way the rap sounds.


> at best obvious topics with all the insight of a million college freshman

Art is weird like that.

We have to remember that it isn't all about needing to learn something new from the experience. Sometimes it's just about getting something out of it.

Looking over the lists of the best songs of all time[1], we can see that there aren't a lot of incredibly insightful songs. Quite frankly, most speak of your "obvious topics" and probably don't talk about them with any sort of magnificent linguistic grandeur.

But that doesn't mean they aren't great songs and don't offer their listener an experience worth sharing and repeating for generations.

[1] - http://en.wikipedia.org/wiki/List_of_songs_considered_the_be...


Often songs - like poetry - often aren't insightful exactly, but provide a good way of communicating the emotions connected with an experience or concept.

Take Smells like Teen Spirit" (which appears on 6 of the lists linked to above): The lyrics are not particularly insightful (almost deliberately), but it captures the goalessness of disaffected youth like no other song.

Nothing Compares 2U* really is unrequited love.

Then there's U2's One (3 lists) which has lyrics that mean different things to different people, and music that can support multiple interpretations[1].

(Of course, some songs on those lists are just entertaining because they are perfect pop ("Billy Jean", "Like a Virgin") or funny ("Baby got Back")

[1] http://www.pophistorydig.com/?tag=u2-one-song-history


I think the point is that giving lengthy quotes of the lyrics is a little pointless if the merit in the art isn't the insight of the lyrics.


I think the term you're grasping for is "Lowest Common Denominator."


I don't disagree that there is a lot of catering to the lowest common denominator in the music industry, both in pop music and the "best songs of all time" lists.

However, I think the parent post has a point. (Now, I'm having a hard time figuring out how to effectively articulate it) The point of music isn't necessarily an insight. Listening to music is an experience, which is about how it makes you feel. Sometimes part of that is giving an insight, sometimes it isn't. Often, it is about combining an idea or concept with a performance or presentation; the idea/concept doesn't need to be insightful to be effective.


exactly. if you connect with art based on how "obvious" you find it, you are going to have a very shallow and boring art career.

sometimes being obvious is what makes it art in the first place. hell, some art needs to be obvious.


The inverse of broad appeal is the concept of The Long Tail, where there's a vast array of niche artists, that appeal to a small number of people.

https://en.wikipedia.org/wiki/Long_Tail

There's a book by the same title, written by Chris Anderson:

http://www.thelongtail.com

Prior to the internet, when advertising and media outlets were centralized, and retail businesses were distributed geographically, it was very difficult to gain a large following with niche appeal. But now that the internet has inverted the scenario, with decentralized, global exposure, and centralized market places like ebay and amazon, niche artists have a fighting chance at becoming famous within their genre.

In other words, it used to be that the only way to catch some exposure was to appeal to centralized broadcasting networks, and they only took chances on performers who were low risk. Now, with the internet, risk doesn't really matter, and mass appeal is literally measured by the size of your following. The larger your following is, by default, the more compromises you'll have made to appeal to everyone following you.

If you capture 1/2 the world as your audience, then you appeal to a broader, and more diverse audience, which has less in common with each other member of your audience, than if you managed to capture 1/4 the world. Getting half the world to agree on something, as opposed to creating something that three quarters of the world cannot relate to.

So, Aesop Rock raps about hating your boss, and many people say: "Gee, yeah, I hate my boss too! This guy's awesome!", but Kool Keith raps about Kenworths with wings, and lots of people are like: "Is he weird?" because lowest common denominator.


There isn't a doubt in my mind that if someone came along and said what the actual grand structure and meaning of reality is, most people on the Internet would dismiss it as college stoner thoughts out of hand.


I don't know, maybe it's because I didn't grow up in HackerNews social circles, but I always felt like I was the only one who thought the concept of a "dream job" was disgusting, and hated jobs in general. I think I'll probably enjoy that song even if me ane Aesop aren't the only two people to feel that way.


listen to the album 'Labor Days'

If you like this abstract hip hop then I highly suggest you delve into the artist MF DOOM.


Aesop is an excellent lyricist. In fact all the MCs on the Rhymesayers label are very talented: Brother Ali, Slug (of Atmosphere) etc.

One MC whose vocabulary always leaves me taken aback is RA Scion, who has been part of the group Common Market. Their song, "My Pathology" [0] is a shining example:

    "Below the terra ferma's the murmur of many men
    Resonatin' the predication of RA's eponym
    It requires a higher degree of thought to transmit
    Elevate above the base and retrace the semantics
    Incommensurately we've been held incommunicado
    From commoner to commodore – they breed bravado
    I exercise authority over the lesser ranks
    We rally and tally up at the shores of the West Bank"

[0] http://lyrics.wikia.com/Common_Market:My_Pathology


Thanks for mentioning Slug. It'd be interesting to see where he (and Bus-Driver) came in on the vocab scale.


Slug, unsurprisingly, wouldn't be very high [0]. I've always felt like nearly every rapper he's associated with is great, but he's kind of mediocre.

I don't see any data on Busdriver, but I imagine he'd fare better.

0 - http://www.reddit.com/r/dataisbeautiful/comments/24nw9p/rapp...


I'd like to also recommend Stones Throw Records for those just getting into this realm of music.

MF DOOM is my favorite lyricist


While I wasn't familiar with the label as such (I was surprised at Madlib/Otis Jackson being featured so prominently) the recent documentary "Our Vinyl Weighs a Ton: This Is Stones Throw Records" is highly recommended for those that want to get a glance at what the label is all about.

And for those that like both poetry and rap, I suggest these two artists (in every way opposite, yet similar): the legendary Gil Scott-Heron: "We Beg your Pardon":

https://www.youtube.com/watch?v=MDCfEkopryo

And young Kate Tempest (here with Canibal Kids):

https://www.youtube.com/watch?v=TUEsihgq8zU

I'm also partial to "the Streets", RZA and a lot more "mainstream hip-hop".


thanks for the recommendations!


You're most welcome. And thank you for the thanks -- it wasn't entirely given that anyone would read them and enjoy them...

[edit: unless you're profile page is meant to be a riddle/tease, I think you might want to put some actual contact info in there... Just, saying (no, I don't have a start-up for you (yet anyway) :-) ]


Interesting comment about the L, S, and D usage and rhyming. I was particularly surprised by the effort that goes into Eminem's rap that I just contributed to "good flow". Some of that effort explained in this video: https://www.youtube.com/watch?v=ooOL4T-BAg0


A friend of mine's theory is that Eminem has a great flow because all the vowels sound alike in his Detroit accent.


Eminem has spoken about how much he enjoys playing with words. One of the techniques he's mentioned is taking two words that don't rhyme and bending the sound of each about half way towards each other.


Easily one of my favorite artists. I'm sad they didn't include more Rhymesayers Artists. I think a lot of them would be to the right of this scale. Guys like P.O.S. and Brother Ali are also very versatile.


+1 for Brother Ali, love his back story too.


Also shocked Atmosphere is not on this list!


Oh my - thanks for the heads up on this i had not heard Atmosphere before!


Found a video rendition of aesop rock's "no regrets" pretty inspiring: https://vimeo.com/14583499

" 1-2-3, that's the speed of the seed

A-B-C, that's the speed of the need

You can dream a little dream or you can live a little dream

I'd rather live it, cause dreamers always chase but never get it"


http://m.youtube.com/watch?v=VNX4spGpIOc&feature=kp I'd recommend Aes' Zodiacpuncture for vocab speed and depth. He basically wins the ranking on that track alone.


OP: Did your analysis of MF DOOM include his work alongside Madlib as Madvillian or his various other pseudonyms (King Geedorah, Viktor Vaughn, etc.)?

I find it a little hard to believe he's not at least in the Wu Tang/Canibus/KK cluster, if not #1 overall.


Yeah I would have though Doom would be very high. But the density of his lyrics perhaps stem more from allusions/references and humour than from the words themselves.


I can't take this list seriously until DOOM is at the top, I agree with you guys. Daniel Dumile is on his own level no doubt.


Seriously, DOOM is in his own league. At one point in "All Outta Ale", he rhymes "3-4-methylenedioxymethamphetamine" with "oxyacetaline."

Also, probably my favorite individual rhyme of all time, from "Meat Grinder": "Borderline schizo, sort of fine tits though"


Agree. I was surprised not to see DOOM as well. Another MC I think would score pretty highly is Chino XL.


I wonder where Weird Al Yankovic would come in on this ranking.


Weird Al's songs are not articulate masterpieces, but cheap parodies of other rap songs and rappers. He's probably somewhere around the 5,000 mark with the other artists.

A cursory google on the size of the average vocabulary [1] yields an interesting fact. I'm not sure how watertight it is. I realise it's probably unfair to compare the size of the average vocabulary to that of a series of songs. Songs being shorter for one. Still, it's interesting.


Not sure if that's fair to Weird Al. The people he parodies wouldn't really agree either [1]. It's not like he's doing the cheap morning show tactic of swapping clean words for dirty words or bad puns but leaving the rest of the song intact. They maintain a consistent theme which is really tough.

[1] http://www.weirdalforum.com/viewtopic.php?t=5673


Yeah, perhaps I was a little harsh. Don't get me wrong, I like Weird Al. I probably could have phrased my comment a little better. I should have said something along the lines of, "In my experience listening to Weird Al, it doesn't feel like he explores a lot of the English language."

Your link is cool, thanks for sharing that.


Makes me very happy to see Aesop Rock in the number #1 spot. He isn't as underground as many people assume, still relatively unknown in the mainstream, but well known enough to sell records and sell-out shows. I wasn't a big fan of his 2012 release Skelethon, but the way he structures his lyrics and the meaning behind them means he never writes a bad lyric.

Interestingly Eminem whom I would have thought would rank pretty highly for his clever method of word bending and enunciation is only in the middle of the scale. Still a whole lot better than some of his counterparts, but still surprising. Another interesting thing to note is Eminem being grouped in the same league as the likes of Jay-Z, Rakim and Lupe Fiasco. With only a couple of hundred unique words separating them from one another.


I always thought eminem was famous for his clever wordplay, not his vocabulary diversity. FWIW, as a non-native speaker I can gather most of his verses. Aesop Rock, on the other hand, is totally indecipherable for me without printed lyrics.


I find it hilarious that DMX is dead last.

I've now got empirical evidence of what I always thought.

I think DMX rhymes words with themselves more than any rapper I've ever heard.


I'm pretty sure this fails to take into account DMX's rich canine vocabulary.


I think Rick Ross would give DMX a run for his money. I've heard him rhyme a word with the same word before (Atlantic).


I said out loud before clicking the link that DMX would likely be dead last.


X gon' Give it to ya? Nope.


Sampling bias.


This is a great graph, but I think it would be neat if a y-axis was thrown in. My first thought was album sales or some other metric of popularity that help you find specific rappers quick instead of going through the huge bunch of little pics.


This reminds me of a PyCon talk from this year in analyzing rap lyrics with some basic NLP techniques

http://pyvideo.org/video/2658/analyzing-rap-lyrics-with-pyth...

The author was trying to see if rappers are considered more hateful towards women by their usage of "bitch per song". The results are quite interesting.


Lil Jon should be at the bottom with 7 words: "Yeah!", "Okay!", "Shots!" and "Turn down for what?"


He's not, because you forgot: "WHAAAAAT?", "SKIT", "SKIT" and "SKIT!".


This infographic doesn't take into account other rappers possibly copying earlier really influential artists, making the earlier influential artists rank lower. More generally, it would be cool to see this chart ranked by the amount of original words present in the first 35,000 lyrics that were not present yet at the albums' time of publication.


To put some perspective on this: ryan@3G08:~/Desktop/bleh$ pdftotext David-Foster-Wallace-Infinite-Jest-v2.0.pdf ryan@3G08:~/Desktop/bleh$ python dfw.py size of vocabulary: 30725

The man passed Shakespeare by 1,896 words with that book.

code:

  import nltk
  from nltk.stem import *
  import string
  
  raw = open("/home/ryan/Desktop/bleh/David-Foster-Wallace-Infinite-Jest-v2.0.txt",'rU').read()
  
  exclude = set(string.punctuation)
  raw = ''.join(ch for ch in raw if ch not in exclude)
  raw = raw.lower()
  
  tokens=nltk.word_tokenize(raw)
  
  stemmer = PorterStemmer()
  stemmed_tokens = set()
  for token in tokens:
  	stemmed_tokens.add(stemmer.stem(token))
  
  print "size of vocabulary:", len(set(stemmed_tokens))


I've been wanting to do some NLP on rap genius's corpus for ages. This is a great analysis. What I had thought of is write a program to detect ghostwriting. Rappers probably have some sort of lyrical 'DNA' in the construction of their verses. How often they use certain words, number of words per line, number of unique words per song, ratio of adjectives to nouns, that kind of thing. You could probably unmask some ghost-writing secrets.

Looking at the analysis here, it's interesting to see some clustering in the results. IMO the second cluster is the sweet spot: Wu Tang's excessive invention of vocabulary is cool but probably detracts from the poetic effect. Meanwhile rappers like 2Pac are just kind of boring IMO, at least going by their lyrics alone.


I'm a big fan of the project and the way it is presented. Not sure why Wu-Tang features so prominently but I guess I'm okay with that. Kool Keith should be broken down further into his constituent parts. I also would have thought the Beastie Boys would have run higher.


I'm pretty sure Kool Keith would not be okay with that.



OP here: many thanks!


I would have been rather surprised not to see Aesop Rock fairly high up the list. I was reading the Rap Genius pages for a few of his tracks the other week and the sheer density of wordplay was fairly overwhelming.

It is rap for geeks though ;)


author here: hit me up with questions you've got.


Any chance you would release your code for this? I'd love to run an analysis on some lesser known rappers and play around with some of the filters. Awesome project btw.

EDIT: The reason I ask is that I assume you don't have the time or desire to add every rapper every person asks you for.


the code is pretty straightforward, just chapter one of the NLTK book and a txt file.

nltk.org/book


Before making this study, what were your predictions? Would have have expected Wu-Tang and GZA to be near the top? What did you expect the average to be?

It would be very interesting to do something similar for rockers too.


I only have hip hop data, sadly.

I didn't expect wu-tang to be on the top since it would go across 11 different vocabs. I felt like the law of averages would basically play out.


an artist that meets the requirements for the data, AZ. He's got 5+ albums, some of which were gold and it'd be interesting as I believe he may also be the highest selling solo artist in the top 10 aside from Ghostface


Did you remove hooks/choruses from the artist's 35,000 word count?


nope. they are in there. if it's on the rap genius lyrics page, I included it.


Awesome work. Did you build a scraper to grab lyrics from Rap Genius?


What are Jay Z's stats?

[Edit] also Notorious B.I.G. :)


Jay Z's stats are there: he's at 4,506. Pretty middle of the pack.

About Biggie:"35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar)"

And to respond to your child comment, I'm sure the same problem (not enough material) applied to Big L.


Also Big L


Greatly enjoyed the analysis but while I was reading it I felt a lot like this guy:

https://www.youtube.com/watch?v=GKlDBi0cyIA


All the rappers listed seem to be American.

Whack this through your Bowers and Wilkins:

https://www.youtube.com/watch?v=p_SQEUZomug


I think the only problem I see is that some rap groups are listed as rappers. For instance beastie boys, de la soul and wu tang are listed. So there is some collective vocabulary being compared to single rappers. That said this is cool and pretty telling. From what I could see it is probably loosely couple to the intelligence of the rappers listed. I will echo the sentiments about DMX here. Looks like some shock jock rappers definitely are low on the list (too short).


> "Wu-Tang Clan at #6 is fucking impressive given that 10 members, with vastly different styles, are equally contributing lyrics. Add the fact that GZA, Ghostface, Raekwon, and Method Man's solo works are also in the top 20 – notably, GZA at #2. Perhaps their countless hours of studio time together (and RZA’s mentorship) exposed each rapper’s vocabulary to one another."

At least in the case of the Wu-Tang Clan, this seems to be done on purpose and suggests that there is a strong correspondence between the individual members' repertoire on one hand and the group's vocabulary as a whole, with a presumed exchange in both direction (i.e. both as a deductive and an inductive process).


This is an interesting analysis.

I love the fact that E-40 is about on par with Shakespeare. I'm sure he would take it as a compliment to be called the modern day Shakespeare.


author here: thanks!


"Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words"

So much for the modern Shakespeares on the list.


unlike shakespeare who was so high class and never made 'your mom' jokes or used any toilet humour or anything like that.



or jokes about "nothing".


As well as the old Shakespeare who was ranked according to the same rules.


I haven't read all his work, but I guess Shakespeare didn't use "hangn" and "hangin" as an alternative to "hanging". The author could validate the words against a dictionary, but it would still be flawed due to conjugations being counted as different words.


What about conjugations like hang'd vs hanged to fit the meter?


I keep getting this error, in Firefox and Chrome:

<Error> <Code>AccessDenied</Code> <Message>Access Denied</Message> <RequestId>3CB1F41D7DFDC794</RequestId> <HostId> wHCPzEYPDsmkMJX+YIgjU40YPrGYytHrk5B44dApi7663NkQQI0RKx9A/6EX7Iph </HostId> </Error>


Do you have https everywhere? I kept getting this error with HTTPS Everywhere. You need to turn off the rule for AWS.


hm. it should be up. try it again.


How about a 2d visualization with a sliding 10000 word window, with the y axis as unique words out of 10k and the x aaxis time. Are there cultural trends that are time dependent? Did young mc and Del use more words than contemporary artists? Did their trends as artists follow the global trend over time?


Maybe this will help me answer that nagging question at the back of my brain: What does DJ Khaled actually _do_?


Would be interesting to see how they compare to rock bands like Titus Andronicus, Fucked Up or Bad Religion.


I wonder where things like classic rock / broadway musicals / opera / etc. fits on this spectrum.

I really appreciate including Shakespeare and Moby Dick on the spectrum, but I'd still like some more perspective. For that matter, I wonder how many unique words I use every day.


Just a note, those artists don't necessarily use all their vocabulary. Eminem for example clearly holds back on his vocabulary. Rap is as much an art as anything can be so there are all sorts of factors. Be careful what you might want to draw here other than curiousity.


> Eminem for example clearly holds back on his vocabulary.

What makes you say this?


He would spend hours studying the dictionary.


I'm sure the answer to this question is "He looks more articulate than the average rapper."


Gibes regarding racism aside, some people seem more articulate due to the fact that they carry themselves differently during interviews.

Of course, many musicians will keep a persona going during interviews as well, so it's still not a very reliable metric.

The most extreme example I've seen was Marilyn Manson, but there are plenty of musicians who rap / sing about really inane stuff and then show that they're way smarter than the way they present themselves with their music.


there are plenty of musicians who rap / sing about really inane stuff and then show that they're way smarter than the way they present themselves with their music.

As mentioned in the article, Jay-Z even raps about doing just that.


Cool to see Canibus so high in the rankings.

It'd also be cool to add the members of AOTP to the analysis.


I would love to see this analysis without filters. Who is the rapper with the largest vocabulary? What does the distribution look like at the top? Surely Antipop Consortium or MF DOOM have larger vocabularies than Aesop for instance.


MF DOOM is on the list. He's above average, but well below Aes or most of Wu-Tang. I've listened to a fair number of rappers, and I was pretty confident Aes would be at the top of this list.

I agree it would be cool to see a list of all rappers, though. I was surprised not to see Del, and maybe there is some more obscure rapper I'm not thinking of with a broader vocabulary than Aes.


I'm pretty sure E-40 scored so high because of all the made up words. He's highly regarded for being innovative and influential but you know for every piece of slang that stuck there's like ten that didn't.


Really surprised MF Doom is not ranked higher – are his side projects included?


Why Jedi Mind Tricks is not counted? He'd be the first in this list; https://www.youtube.com/watch?v=TlZgiK6FiO0


Jedi Mind Tricks is not one person. It's two rappers (Vinnie Paz and Jus Allah).


Jus Allah has been in and out of JMT. http://en.wikipedia.org/wiki/Jedi_Mind_Tricks


It doesn't matter, they are still rappers and I bet they'd be one of the top


You're right. And I agree, they most likely would.

Army of the Pharaohs would be up there as well.


Not particularly surprised at the list. Aesop Rock, the whole Wu-tang Clan, and guys like Nas, Wale, all near the top. DMX and Too Short at the bottom...

Definitely comes out in their music...


  > Definitely comes out in their music...
That's right, Too $hort's music is laser-focused!


How many words in "fo shizzle ma nizzle" ? 4 or 0 ?


4!


I would love the same chart but sorted by vulgarity.


I would love to see the same analysis across different music styles. How compare vocabulary size of Madonna, Bob Dylan and Justin Bieber?


Yeah, I'd like to see where The Clash rank.


I only have rap data, sadly :(


Matt, I can very easily gives you corpus for other artists, I'll send you an email


I would like to see Dälek included in the study. I'd be surprised if they didn't show up on the far right on the scale.


What I would like to see, is this same comparison done against album sales with the implication of mainstream vs. underground.


I kept this focused on vocab so that the data viz was very straightforward and easy to digest/draw insights from. I've had many requests for album sales to be added, and I plan to as soon as possible :)


Was it mentioned where the data was sourced from? I'm not seeing anything and I went back and checked. Did I miss it?


> All lyrics are provided by Rap Genius, but are only current to 2012. My lack of recent data prevented me from using quite a few current artists.


Killah Priest should be grouped with Wu-Tang.


OP: he's an associate, not a member


This is awesome. Reminds me of all the data viz they are doing on rapgenius. You forgot Atmosphere though (Slug)


I'd be interested in how Nerdcore rappers compare to this, such as MC Frontalot or Professor Elemental.


Have you checked out MC Paul Barman?


Not yet, thanks for the hint.


Couldn't find Aceyalone - I thought he'd be in the top 10, I guess he wasn't included.


Top 10? I would put money on him being at least #2 and giving Aesop Rock a run for his money for #1 (depending on which albums his 35,000 words fall on - he's got a solid discography). I just put on A Book of Human Language(https://www.youtube.com/watch?v=lwVNp42l3Xo - full album or https://www.youtube.com/watch?v=GnsCO0Fxw3A - solid song selection). Along with that, Wu Tang as the only crew analyzed? how about Freestyle Fellowship (Aceyalone's crew), Quannum Collective, Jurassic 5.

I'd also like to see Mos Def on the list, along with everyone from Quannum, and the Soulsides and SoundBombing volumes.

Also, (unique words : total words) might be an interesting scoring method, and would allow comparison over entire works regardless of their individual volumes of output. Or choosing a random sample of # of words as opposed to first # of words, as someone who started publishing as a young buck may take a hit for early immaturity.


Mos def is in there


thanks, must have missed him


Aceyalone would be a great addition. I'd also like to see Chali 2na and the whole Jurassic5 crew.


No mention of MF Doom? Metalface? Doom? Victor Vaughan? (All the same gentleman from LA)


MF DOOM is on there pretty much right on the Shakespeare line. It's not clear if it includes his work as Danger / Villain / etc.


Actually he is from UK


So funny comparing this to the same graph they did for pop lyricists.


I wasn't surprised to see Canibus and Outkast up there.


Awesome. This guy should definitely work for RapGenius.com.


author here: the data is straight from RP, so I have been working with them :)


Incredibly, a list about rapper vocabulary is missing anyone associated with nerdcore.

I'm interested to see where the likes of MC Frontalot, Wordburglar, YTCracker, etc. rank on that scale...


The author addresses the "Why is $RAPPER not on the list?" on the larger reddit threads:

http://www.reddit.com/r/hiphopheads/comments/24neym/rappers_...

http://www.reddit.com/r/dataisbeautiful/comments/24nw9p/rapp...

My gut feeling is that the nerdcore guys would be decidedly middle-of-the-road, but I'd love to see the result!


author here: indeed. thanks for adding this to the thread.


I don't see why novelty acts would be relevant.


I'm not really sure that the other acts totally lack artistic integrity, but I'm pretty sure that ytcracker isn't a novelty act.

Some of them aren't serious and have the overdeveloped sense of irony that you expect from people kidding around, but I'm not sure that that makes them novelty music any more than, say, The Electric Six are.


Why do you dismiss nerdcore as "novelty"?


because most of the artists are gimmick acts.


Downvoters, explain please?


I'd love to see Immortal Technique also.


Thank you based god!!


Shouldn't that be adjusted to the size of the text corpus?


The text corpus was normalized to the first 35k words for each rapper.


This might be the best-made infographic I've ever seen.


author here: oh man. flattered.


Woah so awesome!


where is KRS-ONE?


4585 unique words.


I'd really like to see this broken down by established vocabulary and made up vocabulary. I think that would really start to show who were the best lyricists on both ends. Rappers with a lot of made up words might be on the far left, and rappers with a lot of unique words that aren't made up would be on the far right. Both sides of the scale would show rapping talent on different dimensions. Influential rappers like E-40 who add new words to the vocabulary, and wordy rappers like Aes on the right who use a really dense and descriptive vocabulary.


>Rappers with a lot of made up words might be on the far left

It's interesting that you think that. I'd recommend you look at this: http://www.shakespeare-online.com/biography/wordsinvented.ht...

At what point do 'made up' words become 'established'. After they've been published? If so, every made up word in a song should be considered 'established'.


Somebody was listening to "To the Best of Our Knowledge" this weekend, methinks.


I don't know what that is, but I'll check it out.


They happened to have a bit on Shakespeare's word invention: http://www.ttbook.org/book/shakespeares-inflence-stephen-mar...

Claim is that some 1700 or so of his 30,000 word vocabulary was invented by him. Including terms such as "assasination" and "gnarled".


I didn't mean to imply that being on the far left or being made up meant 'worse' in any way. Quite the opposite in fact - I think that if someone can create a word or phrase that becomes normal in everyday usages, it's a sign of... something. Maybe influence, maybe genius. Like you said, Shakespeare first used words that we consider common.

I have no idea at what point a word becomes 'established' though.


I actually revised the conclusion to address this issue. Check it out on the site.


OP here. made up vocabulary is basically all of hip hop. look at any lyrics file and very little is "established", yaknowwhatimsayin?


Would also be nice to see size of the corpus controlled for. What's the rate at which they use new words?


This chart already does that. The corpus for each artist is their first 35,000 words of lyrics.


EVERY word is a "made-up word". Some words were just made up longer ago than others. If you held everyone to this criteria you would severely impoverish language.


Kool Keith should be exempt from this list. He's not from any of the 4 regions listed, but from Jupiter.


For those who didn't get the reference:

> Dr. Octagon is a persona created and used by American rapper Keith Matthew Thornton, better known as Kool Keith...Dr. Octagon is an extraterrestrial time traveling gynecologist and surgeon from the planet Jupiter. [1]

1. http://en.wikipedia.org/wiki/Dr._Octagon


I checked for him immediately once I saw this chart. Between Doc Oc, Dr. Dooom and Black Elvis, he has some insanely weird lyrics.

  We stuck together when one of my parakeets died
  You broke down and cried, for the love of animals
  I used to always cut the legs off a roach
  See if he'll stay there on a piece of tissue
  And give him a piece of toast
  That morning, he would wake up and be gone
  What, the insect had a ambulance?"


I had never heard this particular Kool Keith rap before but as soon as I read the lyrics I could hear his very distinctive delivery pattern. I listened to the track on YouTube and was not surprised it was just as I have imagined. How distinctive.


From the song 'Earth People'

" First patient, pull out the skull, remove the cancer Breaking his back, chisel necks for the answer Supersonic bionic robot voodoo power Equator ex my chance to flex skills on Ampex With power meters and heaters gauge anti-freeze

Octagon oxygen, aluminum intoxicants More ways to blow blood cells in your face React with four bombs and six fire missiles Armed with seven rounds of space doo-doo pistols

You may not believe, living on the Earth planet My skin is green and silver, warhead looking mean

Astronauts get played, tough like the ukelele As I move in rockets, overriding, levels Nothing's aware, same data, same system"

[...] Hook x4

Earth People, New York and California Earth People, I was born on Jupiter.

yup, Jupiter indeed. God I love this song.


Best line comes next though: "With my helmet on, you can't tell me I'm not in space."


it's all good!


And, does his Dr. Octagon work count as Kool Keith or it's own thing? Is his work in Ultramagnetic MCs counted or just solo stuff?


You have moose bumps.


Gotta wonder about the garbage-in factor of Rap Genius. From one randomly selected Aesop Rock cut:

"Please I want to donate my brain to the monstrous Panasonic profit"

I guess it could be. I always heard it as "monstrous Panasonic prophet." It would be in keeping with the previous lyric "Television, all hail grand pixelated god of fantasy."


In that instance my guess is that he probably used a homonym on purpose. Perhaps another interesting breakdown would account (and give extra points) for homonyms since it's sort of 2 words.


There's some other howlers in the same song like "canope" instead of "canopy", "intervenes" instead of "intravenous".


fixed.

This song has no credited transcripter, so it was one of the originals added to "bootstrap" the site, and hence has a lot lower quality then the later songs added and cleaned up by users.


Profit makes more sense to me, because it ties into the directly proceeding line "Please take me, please calm me, please make me a zombie". Aes is clearly talking about advertising and (broadly) media here, and while the double meaning of "prophet" does contribute to the line, the surface meaning is that he's donating his brain (attention) to the Panasonic profit (from advertising).

However, that said, that page (http://rapgenius.com/Aesop-rock-basic-cable-lyrics) does need some love. You should help out! Use the "suggestions" box on individual annotations to leave things for Editors (trusted members of the community, like myself) to integrate into the existing annotations.


It's obviously both "profit" and "prophet." Neither one is wrong.


While neither is really wrong, you've still got to choose one to put on the lyrics page. I believe I outlined my reasons for thinking one is the surface meaning and one is a double meaning.


I can see it either way, but this isn't quantum mechanics. It cannot in fact be both words at once.


While not rap, St. Vincent[1] may want to have some words with you.

Ambiguity can definitely be used in art.

[1] https://www.youtube.com/watch?v=whxF0KhBR84


Of course it can be both. If he deliberately chose it because it is a homophone and intended for either interpretation to work, then it's both.


Actually, that's a hugely common thing to do in rap: make a word ambiguous between 2 to add some more depth to the song.


since i use only rap genius data, i'm hoping the garbage all cancels out so that the rappers are still relatively in the right place.


We might all be self-confessed hackers, but we'll never explicitly confess our adoration for the gloriousness of the genre that is gangster rap.


This comment is nothing more than a cheap attempt to inject musical elitism. Please, if you have nothing critical to say that doesn't also inspire well-meaning debate, then take your business to Slashdot.


What do you mean never explicitly confess? The fuck?

Gangster rap is awesome as well as most of hip hop. But as the chart shows CunninLynguists rules all


The estimate of vocabulary size here is based on the number of unique words used. This seems like it is strongly biased: if two artists have the same size vocabulary, but one has released more albums and thus used more words, that artist will probably have used more unique words. To underscore this point, the number of unique words used by Aesop Rock is half of the estimated vocabulary size of the average college student, although to be fair that estimate is the number of words that an individual can recognize, not the number of words they use. (Edit: the bias is somewhat mitigated by the fact that the same number of words is used to estimate the vocabulary for each artist, but the bias is not dependent on sample size alone but also upon the size of the artist's underlying vocabulary; see my comments below.)

The underlying problem is one of estimating the cardinality of a multinomial distribution given a fixed number of samples. In isolation this problem is ill-posed, since it is always possible that there is a word in a given lyricist's vocabulary that he uses with very low frequency and that is unlikely to appear in any sample, but with appropriate prior information it may be possible to obtain an accurate estimate.

This is not my field, but a brief Google Scholar search shows that there are several papers on estimating vocabulary size, or equivalently, estimating the number of species based on sampling. There is a somewhat dated review (http://cvcl.mit.edu/SUNSeminar/BungeFitzpatrick_1993.pdf) that details some methods of estimation (in this case, I believe we are in the domain of "infinite population, multinomial sample" with unequal class sizes). The paper notes that there is no unbiased estimator available without assumptions on the distribution of word use frequencies, but some of the proposed estimators may be more accurate than the naive estimate used here.


This is why it's "number of unique words used within the artist's first 35,000 lyrics." Sample size is held constant. (Except, maybe, those who haven't yet written 35k words?)


Sample size is already controlled for. See the second paragraph, immediately before the graphic: "I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake."


Are you looking at a different link than us? The intro reads "I decided to compare this data point against the most famous artists in hip hop. I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake." and the title of the infographic (if that's all you looked at) reads "# of Unique words used within artist’s first 35,000 lyrics"

This seems to address your concern completely?


Yes, I missed that, even though it is very clearly spelled out (oops!). It makes the ordinal comparison valid (modulo noise), but it does not completely address the concern. If you have two artists, and artist 1 uses 5,000 unique words in 35,000 lyrics while artist 2 uses 10,000 unique words in 35,000 lyrics, artist 2's vocabulary may be substantially more than twice as large as artist 1's. It is unlikely that a lyricist exhausts their entire vocabulary in such a small sample, particularly if their vocabulary is large and contains many words that they use infrequently. http://www.jstor.org.libproxy.mit.edu/stable/2284147 has a correction that can be applied, although even there the author notes that, when applied to James Joyce's Portrait of an Artist, their technique appears to greatly underestimate Joyce's total vocabulary.


This is a very good point - Aesop Rock, for example, uses one unique word every 5 words (7k unique in 35k), and if this does not stop, maybe he would continue and we would find the same average in 70k or 120k words. After all, you still have to have filler words like "to", "a", "the", "have", etc - he could be saturating the spots where he can put uniques.

So this could substantially underrepresent vocabularies. There are only so many unique words you can put in a sentence. As an extreme, if we looked at the first hundred words of every rapper, we would not find a hundred unique words in any of them: (due to repeats of grammatically common words) even though, clearly, all rappers have a vocabulary over a hundred.

I wonder if this is a fatal flaw? How can we estimate where the distortion stops? (For example, if someone uses 1000 words in their first 35,000, intuitively this seems to imply to me that's most of their stock. But if someone uses 5,000 in 35,000 - that is not so clear at all.)


The paper I linked to in my previous comment uses Zipf's law (briefly, the frequency of word use is inversely proportional to its rank; more at http://en.wikipedia.org/wiki/Zipf%27s_law) to estimate the "distortion." This should produce a better estimate than the naive method, but there are still problems: the plot on the Wikipedia page shows that Zipf's law is not a particularly good fit to word frequency for Wikipedia past the ~10,000th word, and it's not clear that rap music represents a typical natural language corpus. It is probably still possible to devise a correction if one knows how word use frequencies are distributed.

A second related problem that that paper touches on toward the end is that sequential words from the same text are not independent samples from an author's vocabulary. Two artists may have the same vocabulary, but if one artist uses more non-sequiturs, fewer articles, fewer repeated phrases, or generally tries to use more unique words within a given song, then that artist will come out ahead in the measure used here. I'm not sure how much of a problem this really is for comparing lyrics between artists (depending on what is of interest, it may actually be desirable), but it may explain the poor showings for Shakespeare and Melville, since prose is likely to repeat words more frequently than rap lyrics for reasons unrelated to the authors' vocabularies. (FWIW, even conservative estimates put Shakespeare's vocabulary at >15,000 words, which would be hard to measure in a sample of 35,000 words.)


OP here. vocab is non-linear:

mdaniels.com/vocab/scatter.png

so the average would change as your sample size grows.

35K was the threshold where things didn't get garbled (like the 100 word example that you mention).

The threshold is also impacted by who I include. If I went to 50K, I'd lose out on rappers like Drake.


Repetition also seems like an huge issue given a 35,000 lyric limit. If someone repeats the same line 5-7 times it's hardly reasonable to count that when estimating vocabulary.

Edit: A quick check found 7 repetitions of the same 8 word phrase in a DMX song. Which I chose becase he was at the bottom of the list. http://www.azlyrics.com/lyrics/dmx/whatthesebitcheswant.html


this is an issue that i was hoping would cancelled out, given the fact that I use the same analysis for every rapper. In short, it's an exact unique word count, but it's relationally accurate.


The problem you cite only exists if you explicitly want to estimate the underlying vocabulary of the writer. However, as a description of this particular corpus the vocabulary sizes are perfectly valid and exact rather than an estimate.


If we are really only interested in the number of unique words in the first 35,000 lyrics each of these artists have produced and not in what they say about the artists themselves or how the number generalizes to the rest of their body of work, then yes, the analysis is exact and perfect. I don't think that's really the goal, though. We are interested in drawing inferences about the artists and their work. As I say above, the rankings are correct modulo noise (there is noise, unless we wouldn't find it meaningful that these numbers could be different for different 35,000 word samples for the same artist and it is impossible that they could be different for the first 35,000 words due to causes unrelated to those that we are trying to measure), but the magnitudes of the differences could be pretty far off.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: