I enjoyed reading this chart but I hope it doesn't reinforce the bias that some fans have that word complexity is the only way to tell if a rapper is good or not. There are several ways to judge the strength and weaknesses of a rapper. Complexity is one of them, flow is another. Story telling ability is also another very strong in indicator. The best rappers are able to bring a mix while some are just so strong in one area that they explode no matter if they are really weak in other areas.
To fully understand rap, we must first be fluent with its meter, rhyme, and figures of speech. Then ask two questions: One, how artfully has the objective of the song been rendered, and two, how important is that objective. Question one rates the song's perfection, question two rates its importance. And once these questions have been answered, determining a song's greatest becomes a relatively simple matter.
If the song's score for perfection is plotted along the horizontal of a graph, and its importance is plotted on the vertical, then calculating the total area of the song yields the measure of its greatness.
It's kind of like putting book authors in this kind of a list. Using more words and more complex ones doesn't really say anything about the quality of the writing. Pretty interesting data-set for fun though.
William Faulkner once criticised Ernest Hemingway by saying that he "had never been known to use a word that might send the reader to the dictionary."
Hemingway responded by saying, "Poor Faulkner. Does he really think big emotions come from big words? He thinks I don't know the ten-dollar words. I know them all right. But there are older and simpler and better words, and those are the ones I use."
This is fascinating. I'm only a recent listener of hip-hop (primarily because of Earl Sweatshirt and Odd Future) and I'm in awe of the vernacular.
And similarly, as a boredom exercise a few weeks ago I did some lexical analysis of the song Timber (the monstrosity was being constantly played on the radio at the time) and here's what I came out with:
"83.1% of the words in the lyrics are five letters or less, 58.9% are four letters or less. The lexical density (the number of unique words divided by the total number of words, multiplied by one-hundred) is 29.1%. There is only one word in the song which has three or more syllables. Eleven people were involved with the writing of the song, each of them capable of producing just nine unique words each."
My last sentence was intended as a satire, lyrics are obviously not uniformly distributed between writers. I completely agree with you though, the song (despite my disdain for it) is incredibly catchy, and definitely not intended to be thoughtful or thought provoking in nature.
Is that with or without 'the' 'be' 'to' not that its any sort of literary accomplishment regardless but English is a terrible language for lexical density.
You must be using big words in case someone does the analysis on HN comments.
Definitions:
Vernacular: ??
Lexical analysis (in this case): ratio of unique words to non unique words
Lexical density: What persent of the words is unique?
The paragraph in quotes is copied and pasted from when I wrote that a few weeks ago, there's a definition in parentheses following lexical density. Analysis is a word that should not need a definition attached to it (you have used it yourself in your comment.)
Vernacular is commonly used in the United Kingdom. Google will provide you with a definition.
Looked for Canibus near the top and wasn't surprised to find him 4th. If anyone hasn't heard of him, highly suggest listening to his older stuff such as his first Can-I-Bus, 2000 BC and Mic Club.
He raps about science and space all the time which is cool.
Many here seem to be interpreting vocabulary size as a signal for quality. When it comes to rap I completely disagree. Firstly, the repetition is rap's main ingredient. I read an article a while ago where researchers found that listening to a spoken phrase that is looped activates the same part of the brain as music, which helps explain this phenomenon.
Personally, if I want food for thought I read. Rap is not an intellectual pursuit. I've been perusing rappers on this list, and the top artists have not been good at all to my ears. It seems that the best rappers are in the middle, and being on either extreme is a negative signal.
> It seems that the best rappers are in the middle, and being on either extreme is a negative signal.
Objectively, I don't think that is an accurate assessment. There are highly respected artists all across the chart, including the extremes. For instance, Wu-Tang Clan and its members at the upper extreme and Kanye, Snoop and Tupac at the lower extreme.
Yeah I agree, that statement was definitely a stretch. There are well respected rappers across the spectrum, which is my main point: that the interpretation of the chart as a proxy for quality is wrong.
> Firstly, the repetition is rap's main ingredient.
Sure, within one song. But if an artist has a narrow vocabulary throughout their career, it's a sign that they're just writing the same song over and over.
> Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words
Why does that suggest he knew over 100k words? Maybe it means he knew 28,829 and used all of them? Would he really know over 70,000 words he never used in his works? What would those 70,000 words be? Probably very obscure ones. How can you know that many obscure ones?
Vocabulary has for a while been considered in terms of 'receptive' and 'productive' capacity, with the assumption being that ones 'receptive' vocabulary can be larger, since it is easier to hear/read/understand a word than it is to use it correctly in reading/writing (this is not necessarily the popular opinion anymore [http://www.readingconnect.net/web/FILES/english-language-and...] but may provide the context for the claim about Shakespeare). The notion is that you are able to understand more words than you commonly use in your speech/writing, which is on some level intuitive, although of course it is an empirical question.
I imagine that it would be something similar to the German Tank Problem. (http://en.wikipedia.org/wiki/German_tank_problem ) Taking each writing as a sample of the words that are known then would allow for an estimation of total words known. I imagine that this would need to be modified to account for the non-uniform distribution of word use, but the principle would be the same.
The 28K figure is acheived by counting multiple spellings of the same word. Shakespeare lived before dictionaries, so there was never a single standard way to spell a word.
I'm curious if words that Shakespeare invented count. There are many words that we see first used by Shakespeare, though some of them were probably words invented during his time by others with him merely being the first to record (in documents that survived until today).
The latter is probably more the case. The OED, for instance, has had a bias in favor of using Shakespeare as a word's origin since that dictionary's first edition. The number of words attributed to Shakespeare in the OED has dwindled over time.
Just like you or I know tens of thousands of words, but only use some small subset of them in any given work, you wouldn't expect that Shakespeare would use his entire productive vernacular in producing the limited corpus of his literary works.
Its a nice touch including portmanteaus and 'incorrect' ebonics on the list (like "ery'day"), since authors like shakespeare, joyce and others took the same liberties with language. Arguably, that's how language develops and makes it interesting to study and think about. The OP could have easily stuck to words in the OED, kudos.
Really interesting, but not as representative as it should be. It's not clear why some have larger vocabulary than others. It could be using words like "zeitgeist" (in case of Aesop Rock) or some clever wordplay (I don't know much about hip-hop, so I can't find example for some artist from the list right off the bat, but I remember Marilyn Manson using word "gloominati" for instance) or pretty meaningless made up words like "schizzle" (in case of Snoop Dogg) or usual derivatives like "fuckedy fuck". Moreover, in many transcripts for hip-hop people write down words as they are pronounced, which can be pretty much distorted for some artists (which of course ideally shouldn't count as a "new word", but that's complicated, yeah).
While Aeson Rock and DMX are clearly extreme and not surprising at all, it's not that clear for some guys in the middle.
So, first off, for every data project sources should be provided, or at least more specific definition, how text was processed, tokenized, analyzed. Second, several more "data slices" should be provided, for instance 100 most used words which are unique for that artist compared to other artist in the list.
The example you used for clever wordplay, "gloominati," is actually considered a portmanteau word. It's the result of combining multiple words to create a new word. (I say this not to be a pedant, but because I learned the term recently and was amused that we actually have a word for it.)
Yes, portmanteaus are exactly what I was meaning by "clever wordplay". English isn't my primary language so it's hard to remember the right word sometimes, sorry for that. :)
OP here. Do I really need to provide all of this to satisfy the reader's ability to grasp the basic premise of the site? this isn't a thesis or academic pursuit, just comparing some rappers for fun.
I used plain NLTK token analysis on rap genius lyrics. in terms of several more data slices...I agree that there should be more cuts of the data, but you must understand the amount of time that it took me to put this together.
Of course it's entirely up to you what to provide. It would be silly of me to question that. I'm not paying you for that, so how can I demand anything? I'm just saying what I think should've been done. That's how I would do it, at least.
You see, I have a strong opinion on that any data-analytic work is pretty close to being useless if it's not reproducible. And I mean really close. I already mentioned few questions that naturally arise reading your article which are crucial to understand your results and are not addressed by the article. So, ideally, any data analysis done for the open community should provide both full dataset and full sources. Unfortunately, it's not always possible: dataset may be several terabytes long, there might be legal issues about disclosure of the data (or letting everyone know how exactly author acquired them), sources might be under NDA, whatever. In this case description should be as pedantic as possible, because some little details can change the whole meaning of statistics entirely.
By the way, NLTK allows you to do all kinds of processing, so it isn't the answer.
So what would I do? The usual course, actually: would do the work in iPython Notebook, cleaning it up afterwards, would've drawn graphs in place and printed some few slices I've said already while processing, so it would be easier to understand what actually that unique words counted might be like. While fancy d3 graphs are cool for sure, but not nearly as useful.
Maybe this is just me, but it's a little unfair to compare to literary texts.
Humour me for a moment.
When an artist writes a song, he (or she) has constraints. Most rappers would like to rhyme the ends of their sentences. I know sometimes they don't (like poetry), but it's certainly pleasing to the ear to have that constraint. Artists endeavour to make their songs catchy, that's highly correlated with the gross sales of the product.
When an artist writes a novel, this constraint is not weighted quite as highly. I know Shakespeare wrote poetry, too, and to call me out on this comparison is entirely fair. That said, there's also an argument to be made for eye rhymes. Shakespeare used these a lot. Eye rhymes are words that don't rhyme aurally, but do rhyme visually. It's the story that pleases the reader, not necessarily its aural 'catchiness'. I probably made that word up. But Shakespeare made words up too. The point is, you knew what I meant.
At the end of the day these comparisons, while certainly interesting, should be taken with a pinch of salt. While I'm at it, this advice can easily be extrapolated to any dataset. Always understand there may be unknown correlations.
I think I remember reading Aesop say Del influenced him a great deal. If you like Aesop, check out El-P, Cannibal Ox, Illogic, and basically anyone who was associated with Def-Jux.
This looks at the first so many lyrics in each rapper's career. Aesop Rock came out with some weird stuff right off the bat. I wonder if some of these other rappers became more sophisticated over time. Maybe an average per song would be better, or average uniques per word, would be better.
The problem with average per song is that you "use up" words in every new song, so all things being equal each marginal song has progressively fewer new words.
I bet you could get something insightful from plotting "unique words" versus "total words" - That might give a good idea of the amount of repetition over time, the length or quantity of output, and the total vocabulary.
If a rapper released one song using n distinct words their score would be n/1, and if they released a second song using the same set of words their score would halve, to n/2, despite the fact their demonstrated vocabulary is still n words.
In fact, if their first song used n distinct words and their second used a completely distinct set of words, but the second song was shorter than the first, their score would drop.
That would be unusual behaviour for a measure of vocabulary.
I don't think that's what the poster meant. By "average unique words per song" I take it to mean, within each song words are only counted once, but across songs, words can be counted multiple times. So if song A had the words "I like cats" and song B had the words "I like dogs", then the average unique word count would be ((3 + 3) / 2) = 3, not ((3 + 1)/2) = 2.
That's definitely one solution, but it still wouldn't quite capture it. As an extreme example, if rapper A produced 100 songs, each with exactly the same lyrics, they should surely be penalized compared with rapper B producing 100 songs with no shared words— even if rapper A's average unique-words-per-song is higher than rapper B's.
For those who aren't familiar with Aesop Rock, I'd invite you to give him a listen sometime. His earlier albums, in particular, have been very influential to me in many ways. Both in my artistic and professional careers.
From comments on the conditions of the working man and the condition of feeling trapped in a "j-o-b"[1]:
"Now we the American working population
Hate the fact that eight hours a day
Is wasted on chasing the dream of someone that isn't us
And we may not hate our jobs
But we hate jobs in general
That don't have to do with fighting our own causes
We the American working population
Hate the nine-to-five day-in day-out
When we'd rather be supporting ourselves
By being paid to perfect the pastimes
That we have harbored based solely on the fact
That it makes us smile if it sounds dope"
To storytelling masterpieces regarding living and dreaming[2]:
"Look, I've never had a dream in my life
Because a dream is what you wanna do, but still haven't pursued
I knew what I wanted and did it till it was done
So I've been the dream that I wanted to be since day one!"
Aesop Rock takes language and linguistics to entirely different levels than one might expect from the single genre that is hip-hop. He even challenges himself and the listeners, playing fantastic word games, for instance re-using the letters L, S, and D in odd and rhythmical ways after a mention[3]:
"Lazy summer days
Like some decrepit landshark dumb luck squad dog lurks sicker deluded
Last sturdy domino lean's secluded
Don't let stupid delusions lesson super-duty labor students
Dragnet lifer solutions
Daddy loved sloppy dimensions like son-daughter links
Such determinated lepers, successfully disheveled
Little soliders developed like serpents despite life sentence ducking
Lemmings
Some don't like sobriety's dirty lenses
Some do"
And then there are just incredible gems that stick with you like[4]:
"I don't flick neeedles like my sick friend
I don't march like Beetle Bailey through a quick trend
I don't frequent church's steeples on my weekend
And I don't comment if you formulate a weak Zen"
There's a lot to explore from Aesop Rock. Should you find this type of hip-hop interesting, a decent place to start is with the label you can find these songs on, Definitive Jux[5]. Incredible talent has been on and off that label over the years. So much good stuff.
I don't know man, I listened to a couple of the tracks and he definitely has lyrical skills, and I like some of the tracks, but the quotes you selected aren't very good at all, at best obvious topics with all the insight of a million college freshmen. Having said that I like "None Shall Pass" that has a really great sound.
To be entirely honest, I love rap, but not for any insight rappers have in world affairs, but for their lyrical ability. Some are very good at providing unique ways to describe their own insights about their lives but when someone starts rapping about world problems I just want to shut my brain off because it's usually pretty banal. Then with my brain off I can still at least enjoy the way the rap sounds.
> at best obvious topics with all the insight of a million college freshman
Art is weird like that.
We have to remember that it isn't all about needing to learn something new from the experience. Sometimes it's just about getting something out of it.
Looking over the lists of the best songs of all time[1], we can see that there aren't a lot of incredibly insightful songs. Quite frankly, most speak of your "obvious topics" and probably don't talk about them with any sort of magnificent linguistic grandeur.
But that doesn't mean they aren't great songs and don't offer their listener an experience worth sharing and repeating for generations.
Often songs - like poetry - often aren't insightful exactly, but provide a good way of communicating the emotions connected with an experience or concept.
Take Smells like Teen Spirit" (which appears on 6 of the lists linked to above): The lyrics are not particularly insightful (almost deliberately), but it captures the goalessness of disaffected youth like no other song.
Nothing Compares 2U* really is unrequited love.
Then there's U2's One (3 lists) which has lyrics that mean different things to different people, and music that can support multiple interpretations[1].
(Of course, some songs on those lists are just entertaining because they are perfect pop ("Billy Jean", "Like a Virgin") or funny ("Baby got Back")
I don't disagree that there is a lot of catering to the lowest common denominator in the music industry, both in pop music and the "best songs of all time" lists.
However, I think the parent post has a point. (Now, I'm having a hard time figuring out how to effectively articulate it) The point of music isn't necessarily an insight. Listening to music is an experience, which is about how it makes you feel. Sometimes part of that is giving an insight, sometimes it isn't. Often, it is about combining an idea or concept with a performance or presentation; the idea/concept doesn't need to be insightful to be effective.
Prior to the internet, when advertising and media outlets were centralized, and retail businesses were distributed geographically, it was very difficult to gain a large following with niche appeal. But now that the internet has inverted the scenario, with decentralized, global exposure, and centralized market places like ebay and amazon, niche artists have a fighting chance at becoming famous within their genre.
In other words, it used to be that the only way to catch some exposure was to appeal to centralized broadcasting networks, and they only took chances on performers who were low risk. Now, with the internet, risk doesn't really matter, and mass appeal is literally measured by the size of your following. The larger your following is, by default, the more compromises you'll have made to appeal to everyone following you.
If you capture 1/2 the world as your audience, then you appeal to a broader, and more diverse audience, which has less in common with each other member of your audience, than if you managed to capture 1/4 the world. Getting half the world to agree on something, as opposed to creating something that three quarters of the world cannot relate to.
So, Aesop Rock raps about hating your boss, and many people say: "Gee, yeah, I hate my boss too! This guy's awesome!", but Kool Keith raps about Kenworths with wings, and lots of people are like: "Is he weird?" because lowest common denominator.
There isn't a doubt in my mind that if someone came along and said what the actual grand structure and meaning of reality is, most people on the Internet would dismiss it as college stoner thoughts out of hand.
I don't know, maybe it's because I didn't grow up in HackerNews social circles, but I always felt like I was the only one who thought the concept of a "dream job" was disgusting, and hated jobs in general. I think I'll probably enjoy that song even if me ane Aesop aren't the only two people to feel that way.
Aesop is an excellent lyricist. In fact all the MCs on the Rhymesayers label are very talented: Brother Ali, Slug (of Atmosphere) etc.
One MC whose vocabulary always leaves me taken aback is RA Scion, who has been part of the group Common Market. Their song, "My Pathology" [0] is a shining example:
"Below the terra ferma's the murmur of many men
Resonatin' the predication of RA's eponym
It requires a higher degree of thought to transmit
Elevate above the base and retrace the semantics
Incommensurately we've been held incommunicado
From commoner to commodore – they breed bravado
I exercise authority over the lesser ranks
We rally and tally up at the shores of the West Bank"
While I wasn't familiar with the label as such (I was surprised at Madlib/Otis Jackson being featured so prominently) the recent documentary "Our Vinyl Weighs a Ton: This Is Stones Throw Records" is highly recommended for those that want to get a glance at what the label is all about.
And for those that like both poetry and rap, I suggest these two artists (in every way opposite, yet similar): the legendary Gil Scott-Heron: "We Beg your Pardon":
You're most welcome. And thank you for the thanks -- it wasn't entirely given that anyone would read them and enjoy them...
[edit: unless you're profile page is meant to be a riddle/tease, I think you might want to put some actual contact info in there... Just, saying (no, I don't have a start-up for you (yet anyway) :-) ]
Interesting comment about the L, S, and D usage and rhyming. I was particularly surprised by the effort that goes into Eminem's rap that I just contributed to "good flow". Some of that effort explained in this video: https://www.youtube.com/watch?v=ooOL4T-BAg0
Eminem has spoken about how much he enjoys playing with words. One of the techniques he's mentioned is taking two words that don't rhyme and bending the sound of each about half way towards each other.
Easily one of my favorite artists. I'm sad they didn't include more Rhymesayers Artists. I think a lot of them would be to the right of this scale. Guys like P.O.S. and Brother Ali are also very versatile.
Yeah I would have though Doom would be very high. But the density of his lyrics perhaps stem more from allusions/references and humour than from the words themselves.
Weird Al's songs are not articulate masterpieces, but cheap parodies of other rap songs and rappers. He's probably somewhere around the 5,000 mark with the other artists.
A cursory google on the size of the average vocabulary [1] yields an interesting fact. I'm not sure how watertight it is. I realise it's probably unfair to compare the size of the average vocabulary to that of a series of songs. Songs being shorter for one. Still, it's interesting.
Not sure if that's fair to Weird Al. The people he parodies wouldn't really agree either [1]. It's not like he's doing the cheap morning show tactic of swapping clean words for dirty words or bad puns but leaving the rest of the song intact. They maintain a consistent theme which is really tough.
Yeah, perhaps I was a little harsh. Don't get me wrong, I like Weird Al. I probably could have phrased my comment a little better. I should have said something along the lines of, "In my experience listening to Weird Al, it doesn't feel like he explores a lot of the English language."
Makes me very happy to see Aesop Rock in the number #1 spot. He isn't as underground as many people assume, still relatively unknown in the mainstream, but well known enough to sell records and sell-out shows. I wasn't a big fan of his 2012 release Skelethon, but the way he structures his lyrics and the meaning behind them means he never writes a bad lyric.
Interestingly Eminem whom I would have thought would rank pretty highly for his clever method of word bending and enunciation is only in the middle of the scale. Still a whole lot better than some of his counterparts, but still surprising. Another interesting thing to note is Eminem being grouped in the same league as the likes of Jay-Z, Rakim and Lupe Fiasco. With only a couple of hundred unique words separating them from one another.
I always thought eminem was famous for his clever wordplay, not his vocabulary diversity. FWIW, as a non-native speaker I can gather most of his verses. Aesop Rock, on the other hand, is totally indecipherable for me without printed lyrics.
This is a great graph, but I think it would be neat if a y-axis was thrown in. My first thought was album sales or some other metric of popularity that help you find specific rappers quick instead of going through the huge bunch of little pics.
The author was trying to see if rappers are considered more hateful towards women by their usage of "bitch per song". The results are quite interesting.
This infographic doesn't take into account other rappers possibly copying earlier really influential artists, making the earlier influential artists rank lower. More generally, it would be cool to see this chart ranked by the amount of original words present in the first 35,000 lyrics that were not present yet at the albums' time of publication.
To put some perspective on this:
ryan@3G08:~/Desktop/bleh$ pdftotext David-Foster-Wallace-Infinite-Jest-v2.0.pdf
ryan@3G08:~/Desktop/bleh$ python dfw.py
size of vocabulary: 30725
The man passed Shakespeare by 1,896 words with that book.
code:
import nltk
from nltk.stem import *
import string
raw = open("/home/ryan/Desktop/bleh/David-Foster-Wallace-Infinite-Jest-v2.0.txt",'rU').read()
exclude = set(string.punctuation)
raw = ''.join(ch for ch in raw if ch not in exclude)
raw = raw.lower()
tokens=nltk.word_tokenize(raw)
stemmer = PorterStemmer()
stemmed_tokens = set()
for token in tokens:
stemmed_tokens.add(stemmer.stem(token))
print "size of vocabulary:", len(set(stemmed_tokens))
I've been wanting to do some NLP on rap genius's corpus for ages. This is a great analysis. What I had thought of is write a program to detect ghostwriting. Rappers probably have some sort of lyrical 'DNA' in the construction of their verses. How often they use certain words, number of words per line, number of unique words per song, ratio of adjectives to nouns, that kind of thing. You could probably unmask some ghost-writing secrets.
Looking at the analysis here, it's interesting to see some clustering in the results. IMO the second cluster is the sweet spot: Wu Tang's excessive invention of vocabulary is cool but probably detracts from the poetic effect. Meanwhile rappers like 2Pac are just kind of boring IMO, at least going by their lyrics alone.
I'm a big fan of the project and the way it is presented. Not sure why Wu-Tang features so prominently but I guess I'm okay with that. Kool Keith should be broken down further into his constituent parts. I also would have thought the Beastie Boys would have run higher.
I would have been rather surprised not to see Aesop Rock fairly high up the list. I was reading the Rap Genius pages for a few of his tracks the other week and the sheer density of wordplay was fairly overwhelming.
Any chance you would release your code for this? I'd love to run an analysis on some lesser known rappers and play around with some of the filters. Awesome project btw.
EDIT: The reason I ask is that I assume you don't have the time or desire to add every rapper every person asks you for.
Before making this study, what were your predictions? Would have have expected Wu-Tang and GZA to be near the top?
What did you expect the average to be?
It would be very interesting to do something similar for rockers too.
an artist that meets the requirements for the data, AZ. He's got 5+ albums, some of which were gold and it'd be interesting as I believe he may also be the highest selling solo artist in the top 10 aside from Ghostface
Jay Z's stats are there: he's at 4,506. Pretty middle of the pack.
About Biggie:"35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar)"
And to respond to your child comment, I'm sure the same problem (not enough material) applied to Big L.
I think the only problem I see is that some rap groups are listed as rappers. For instance beastie boys, de la soul and wu tang are listed. So there is some collective vocabulary being compared to single rappers. That said this is cool and pretty telling. From what I could see it is probably loosely couple to the intelligence of the rappers listed. I will echo the sentiments about DMX here. Looks like some shock jock rappers definitely are low on the list (too short).
> "Wu-Tang Clan at #6 is fucking impressive given that 10 members, with vastly different styles, are equally contributing lyrics. Add the fact that GZA, Ghostface, Raekwon, and Method Man's solo works are also in the top 20 – notably, GZA at #2. Perhaps their countless hours of studio time together (and RZA’s mentorship) exposed each rapper’s vocabulary to one another."
At least in the case of the Wu-Tang Clan, this seems to be done on purpose and suggests that there is a strong correspondence between the individual members' repertoire on one hand and the group's vocabulary as a whole, with a presumed exchange in both direction (i.e. both as a deductive and an inductive process).
I haven't read all his work, but I guess Shakespeare didn't use "hangn" and "hangin" as an alternative to "hanging".
The author could validate the words against a dictionary, but it would still be flawed due to conjugations being counted as different words.
How about a 2d visualization with a sliding 10000 word window, with the y axis as unique words out of 10k and the x aaxis time. Are there cultural trends that are time dependent? Did young mc and Del use more words than contemporary artists? Did their trends as artists follow the global trend over time?
I wonder where things like classic rock / broadway musicals / opera / etc. fits on this spectrum.
I really appreciate including Shakespeare and Moby Dick on the spectrum, but I'd still like some more perspective. For that matter, I wonder how many unique words I use every day.
Just a note, those artists don't necessarily use all their vocabulary. Eminem for example clearly holds back on his vocabulary. Rap is as much an art as anything can be so there are all sorts of factors. Be careful what you might want to draw here other than curiousity.
Gibes regarding racism aside, some people seem more articulate due to the fact that they carry themselves differently during interviews.
Of course, many musicians will keep a persona going during interviews as well, so it's still not a very reliable metric.
The most extreme example I've seen was Marilyn Manson, but there are plenty of musicians who rap / sing about really inane stuff and then show that they're way smarter than the way they present themselves with their music.
there are plenty of musicians who rap / sing about really inane stuff and then show that they're way smarter than the way they present themselves with their music.
As mentioned in the article, Jay-Z even raps about doing just that.
I would love to see this analysis without filters. Who is the rapper with the largest vocabulary? What does the distribution look like at the top? Surely Antipop Consortium or MF DOOM have larger vocabularies than Aesop for instance.
MF DOOM is on the list. He's above average, but well below Aes or most of Wu-Tang. I've listened to a fair number of rappers, and I was pretty confident Aes would be at the top of this list.
I agree it would be cool to see a list of all rappers, though. I was surprised not to see Del, and maybe there is some more obscure rapper I'm not thinking of with a broader vocabulary than Aes.
I'm pretty sure E-40 scored so high because of all the made up words. He's highly regarded for being innovative and influential but you know for every piece of slang that stuck there's like ten that didn't.
Not particularly surprised at the list. Aesop Rock, the whole Wu-tang Clan, and guys like Nas, Wale, all near the top. DMX and Too Short at the bottom...
I kept this focused on vocab so that the data viz was very straightforward and easy to digest/draw insights from. I've had many requests for album sales to be added, and I plan to as soon as possible :)
Top 10? I would put money on him being at least #2 and giving Aesop Rock a run for his money for #1 (depending on which albums his 35,000 words fall on - he's got a solid discography). I just put on A Book of Human Language(https://www.youtube.com/watch?v=lwVNp42l3Xo - full album or https://www.youtube.com/watch?v=GnsCO0Fxw3A - solid song selection). Along with that, Wu Tang as the only crew analyzed? how about Freestyle Fellowship (Aceyalone's crew), Quannum Collective, Jurassic 5.
I'd also like to see Mos Def on the list, along with everyone from Quannum, and the Soulsides and SoundBombing volumes.
Also, (unique words : total words) might be an interesting scoring method, and would allow comparison over entire works regardless of their individual volumes of output. Or choosing a random sample of # of words as opposed to first # of words, as someone who started publishing as a young buck may take a hit for early immaturity.
I'm not really sure that the other acts totally lack artistic integrity, but I'm pretty sure that ytcracker isn't a novelty act.
Some of them aren't serious and have the overdeveloped sense of irony that you expect from people kidding around, but I'm not sure that that makes them novelty music any more than, say, The Electric Six are.
I'd really like to see this broken down by established vocabulary and made up vocabulary. I think that would really start to show who were the best lyricists on both ends. Rappers with a lot of made up words might be on the far left, and rappers with a lot of unique words that aren't made up would be on the far right. Both sides of the scale would show rapping talent on different dimensions. Influential rappers like E-40 who add new words to the vocabulary, and wordy rappers like Aes on the right who use a really dense and descriptive vocabulary.
At what point do 'made up' words become 'established'. After they've been published? If so, every made up word in a song should be considered 'established'.
I didn't mean to imply that being on the far left or being made up meant 'worse' in any way. Quite the opposite in fact - I think that if someone can create a word or phrase that becomes normal in everyday usages, it's a sign of... something. Maybe influence, maybe genius. Like you said, Shakespeare first used words that we consider common.
I have no idea at what point a word becomes 'established' though.
EVERY word is a "made-up word". Some words were just made up longer ago than others. If you held everyone to this criteria you would severely impoverish language.
> Dr. Octagon is a persona created and used by American rapper Keith Matthew Thornton, better known as Kool Keith...Dr. Octagon is an extraterrestrial time traveling gynecologist and surgeon from the planet Jupiter. [1]
I checked for him immediately once I saw this chart. Between Doc Oc, Dr. Dooom and Black Elvis, he has some insanely weird lyrics.
We stuck together when one of my parakeets died
You broke down and cried, for the love of animals
I used to always cut the legs off a roach
See if he'll stay there on a piece of tissue
And give him a piece of toast
That morning, he would wake up and be gone
What, the insect had a ambulance?"
I had never heard this particular Kool Keith rap before but as soon as I read the lyrics I could hear his very distinctive delivery pattern. I listened to the track on YouTube and was not surprised it was just as I have imagined. How distinctive.
" First patient, pull out the skull, remove the cancer
Breaking his back, chisel necks for the answer
Supersonic bionic robot voodoo power
Equator ex my chance to flex skills on Ampex
With power meters and heaters gauge anti-freeze
Octagon oxygen, aluminum intoxicants
More ways to blow blood cells in your face
React with four bombs and six fire missiles
Armed with seven rounds of space doo-doo pistols
You may not believe, living on the Earth planet
My skin is green and silver, warhead looking mean
Astronauts get played, tough like the ukelele
As I move in rockets, overriding, levels
Nothing's aware, same data, same system"
[...]
Hook x4
Earth People, New York and California
Earth People, I was born on Jupiter.
Gotta wonder about the garbage-in factor of Rap Genius. From one randomly selected Aesop Rock cut:
"Please I want to donate my brain to the monstrous Panasonic profit"
I guess it could be. I always heard it as "monstrous Panasonic prophet." It would be in keeping with the previous lyric "Television, all hail grand pixelated god of
fantasy."
In that instance my guess is that he probably used a homonym on purpose. Perhaps another interesting breakdown would account (and give extra points) for homonyms since it's sort of 2 words.
This song has no credited transcripter, so it was one of the originals added to "bootstrap" the site, and hence has a lot lower quality then the later songs added and cleaned up by users.
Profit makes more sense to me, because it ties into the directly proceeding line "Please take me, please calm me, please make me a zombie". Aes is clearly talking about advertising and (broadly) media here, and while the double meaning of "prophet" does contribute to the line, the surface meaning is that he's donating his brain (attention) to the Panasonic profit (from advertising).
However, that said, that page (http://rapgenius.com/Aesop-rock-basic-cable-lyrics) does need some love. You should help out! Use the "suggestions" box on individual annotations to leave things for Editors (trusted members of the community, like myself) to integrate into the existing annotations.
While neither is really wrong, you've still got to choose one to put on the lyrics page. I believe I outlined my reasons for thinking one is the surface meaning and one is a double meaning.
This comment is nothing more than a cheap attempt to inject musical elitism. Please, if you have nothing critical to say that doesn't also inspire well-meaning debate, then take your business to Slashdot.
The estimate of vocabulary size here is based on the number of unique words used. This seems like it is strongly biased: if two artists have the same size vocabulary, but one has released more albums and thus used more words, that artist will probably have used more unique words. To underscore this point, the number of unique words used by Aesop Rock is half of the estimated vocabulary size of the average college student, although to be fair that estimate is the number of words that an individual can recognize, not the number of words they use. (Edit: the bias is somewhat mitigated by the fact that the same number of words is used to estimate the vocabulary for each artist, but the bias is not dependent on sample size alone but also upon the size of the artist's underlying vocabulary; see my comments below.)
The underlying problem is one of estimating the cardinality of a multinomial distribution given a fixed number of samples. In isolation this problem is ill-posed, since it is always possible that there is a word in a given lyricist's vocabulary that he uses with very low frequency and that is unlikely to appear in any sample, but with appropriate prior information it may be possible to obtain an accurate estimate.
This is not my field, but a brief Google Scholar search shows that there are several papers on estimating vocabulary size, or equivalently, estimating the number of species based on sampling. There is a somewhat dated review (http://cvcl.mit.edu/SUNSeminar/BungeFitzpatrick_1993.pdf) that details some methods of estimation (in this case, I believe we are in the domain of "infinite population, multinomial sample" with unequal class sizes). The paper notes that there is no unbiased estimator available without assumptions on the distribution of word use frequencies, but some of the proposed estimators may be more accurate than the naive estimate used here.
This is why it's "number of unique words used within the artist's first 35,000 lyrics." Sample size is held constant. (Except, maybe, those who haven't yet written 35k words?)
Sample size is already controlled for. See the second paragraph, immediately before the graphic:
"I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake."
Are you looking at a different link than us? The intro reads "I decided to compare this data point against the most famous artists in hip hop. I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake." and the title of the infographic (if that's all you looked at) reads "# of Unique words used within artist’s first 35,000 lyrics"
Yes, I missed that, even though it is very clearly spelled out (oops!). It makes the ordinal comparison valid (modulo noise), but it does not completely address the concern. If you have two artists, and artist 1 uses 5,000 unique words in 35,000 lyrics while artist 2 uses 10,000 unique words in 35,000 lyrics, artist 2's vocabulary may be substantially more than twice as large as artist 1's. It is unlikely that a lyricist exhausts their entire vocabulary in such a small sample, particularly if their vocabulary is large and contains many words that they use infrequently. http://www.jstor.org.libproxy.mit.edu/stable/2284147 has a correction that can be applied, although even there the author notes that, when applied to James Joyce's Portrait of an Artist, their technique appears to greatly underestimate Joyce's total vocabulary.
This is a very good point - Aesop Rock, for example, uses one unique word every 5 words (7k unique in 35k), and if this does not stop, maybe he would continue and we would find the same average in 70k or 120k words. After all, you still have to have filler words like "to", "a", "the", "have", etc - he could be saturating the spots where he can put uniques.
So this could substantially underrepresent vocabularies. There are only so many unique words you can put in a sentence. As an extreme, if we looked at the first hundred words of every rapper, we would not find a hundred unique words in any of them: (due to repeats of grammatically common words) even though, clearly, all rappers have a vocabulary over a hundred.
I wonder if this is a fatal flaw? How can we estimate where the distortion stops? (For example, if someone uses 1000 words in their first 35,000, intuitively this seems to imply to me that's most of their stock. But if someone uses 5,000 in 35,000 - that is not so clear at all.)
The paper I linked to in my previous comment uses Zipf's law (briefly, the frequency of word use is inversely proportional to its rank; more at http://en.wikipedia.org/wiki/Zipf%27s_law) to estimate the "distortion." This should produce a better estimate than the naive method, but there are still problems: the plot on the Wikipedia page shows that Zipf's law is not a particularly good fit to word frequency for Wikipedia past the ~10,000th word, and it's not clear that rap music represents a typical natural language corpus. It is probably still possible to devise a correction if one knows how word use frequencies are distributed.
A second related problem that that paper touches on toward the end is that sequential words from the same text are not independent samples from an author's vocabulary. Two artists may have the same vocabulary, but if one artist uses more non-sequiturs, fewer articles, fewer repeated phrases, or generally tries to use more unique words within a given song, then that artist will come out ahead in the measure used here. I'm not sure how much of a problem this really is for comparing lyrics between artists (depending on what is of interest, it may actually be desirable), but it may explain the poor showings for Shakespeare and Melville, since prose is likely to repeat words more frequently than rap lyrics for reasons unrelated to the authors' vocabularies. (FWIW, even conservative estimates put Shakespeare's vocabulary at >15,000 words, which would be hard to measure in a sample of 35,000 words.)
Repetition also seems like an huge issue given a 35,000 lyric limit. If someone repeats the same line 5-7 times it's hardly reasonable to count that when estimating vocabulary.
this is an issue that i was hoping would cancelled out, given the fact that I use the same analysis for every rapper. In short, it's an exact unique word count, but it's relationally accurate.
The problem you cite only exists if you explicitly want to estimate the underlying vocabulary of the writer. However, as a description of this particular corpus the vocabulary sizes are perfectly valid and exact rather than an estimate.
If we are really only interested in the number of unique words in the first 35,000 lyrics each of these artists have produced and not in what they say about the artists themselves or how the number generalizes to the rest of their body of work, then yes, the analysis is exact and perfect. I don't think that's really the goal, though. We are interested in drawing inferences about the artists and their work. As I say above, the rankings are correct modulo noise (there is noise, unless we wouldn't find it meaningful that these numbers could be different for different 35,000 word samples for the same artist and it is impossible that they could be different for the first 35,000 words due to causes unrelated to those that we are trying to measure), but the magnitudes of the differences could be pretty far off.