Stanford researchers to open source model they say has nailed sentiment analysis

hooande · on Oct 4, 2013

Socher thinks his model could reach upward of 95 percent accuracy, but it will never be completely perfect.

Ridiculous accuracy for something as complex as sentiment analysis. You don't hear established researchers say something like this often. Moving any number from 85% to 95% is the work of the gods.

I wonder if the code they release will include some version of the data from the mechnical turk project. Code for this is great and many people (myself included) will be able to learn a lot from it. But it won't have the same level of reproducibility without the data.

If they do release it they will effectively be giving away the money they spent on mechanical turk. 11,000 HITS ain't cheap and they probably had redundant sampling. If they decide to make this data public as well it would be a big win for research because labelled data is so important to machine learning work.

Open sourcing the code associated with a research paper is already a huge deal. It's great to see big name researchers like Andrew Ng pushing the trend for publishing code. If nothing else this is a great example for computer science papers going forward.

ninjin · on Oct 4, 2013

> I wonder if the code they release will include some version of the data from the mechnical turk project.

I don't know if the "raw" data from MTurk is included in the data set. But at least the finalised data has been released for quite some time now.

http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.z...

As for the value of the data, I have heard numbers around 10,000 USD. But I have little more than academic gossip to back these numbers up.

mrgordon · on Oct 4, 2013

It is great to see more crowdsourcing data getting open sourced.

I haven't read the research but I'm assuming he used a fairly unrestricted crowd. I'd be interested to use the crowd to rate sentiment analysis results from this model vs. CrowdFlower's Senti to see how the model fares against a specialized sentiment analysis crowd. I would be very surprised if the model won. I am willing to run a comparison if there is sufficient interest in the data.

warmfuzzykitten · on Oct 4, 2013

Yes, but a little realism here: They have actually moved it from 80% (using single word analysis) to 85%. That's fantastic, but not the work of the gods.

ninjin · on Oct 4, 2013

I agree that the article is sensational and what really confuses me is that when I read the article and their press release (found below) they are far more humble and down to earth (I really like the writing and the work in general). My personal guess is that this is a classical case of a journalist going overboard (been there...).

http://engineering.stanford.edu/news/stanford-algorithm-anal...

nl · on Oct 4, 2013

I dunno..

Sentiment analysis is often akin to mind reading. I don't know how the average human would do, but given how often people miss jokes & sarcasm on forums & emails it wouldn't surprise me at all to learn that the average human analysis would score below 85% on single sentences without context.

(And to be fair to the OP, they said 95% would be the work of gods, and I'd certainly agree that would be better than most humans could do)

PeterisP · on Oct 5, 2013

In general, after the 'easy' gains on a decently researched machine learning problem have been done, a new approach that gets, say, a 2% increase (say, from 91% to 93%) is considered a significant breakthrough and happens rarely - 5-10 years or so. 80%->85% is great. 80%->95% could be called work of the gods, given the 4x reduction in error rate.

adrr · on Oct 4, 2013

Never used Mechanical Turk, 11,000 HITS at 2 to 3 cents(a guess) doesn't seem that pricey. Its the cost of a couple text books. :)

aantix · on Oct 4, 2013

Sentiment analysis is something the Turkers can do quickly and efficiently, so those tasks tend to go lower because you can recruit such a larger worker pool for them. Same for image classifications.

grogenaut · on Oct 4, 2013

good sentiment or survey results generally run more like $.75-$2

linux_devil · on Oct 4, 2013

Its been there for long time, I don't understand why is he glorifying what Google has achieved in ages. Honestly , I don't find anything new in technique.

mrmaddog · on Oct 4, 2013

Here is a link to the live demo: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

This is really fun to play with, and I'm surprised how well it can parse the sentiment of sample sentences I threw at it. I've tried a couple random examples (like "I don't know what the artist was smoking, but the song made no sense (though I liked the beat!)") and have not yet gotten a wrong analysis. Even the phrase parsing is pretty spot-on.

As a side note, this is much more interesting than the "sediment" analysis I excepted after skimming the title. (Unfortunately though, the analyzer got this final sentence wrong: http://cl.ly/image/301u1q46263m)

Edit: seems like this system could get significantly more robust with more data. If you look in the comments section, you can see some comments from the professor himself, i.e. "Possibly because the word "buying", only appears once in the entire dataset and it's in a pretty negative context: http://nlp.stanford.edu/sentiment/treebank.html?w=buying"

If you gave it 100,000 phrases, I wouldn't be surprised if it could hit the 95% mark that Socher mentions.

ksrm · on Oct 4, 2013

To really put it through its paces you need to feed it reviews of The Room[1] from imdb:

>It is, without question, the worst film ever made. (--) But this comment is in no way meant to be discouraging. (-) Because while The Room is the worst movie ever made it is also the greatest way to spend a blisteringly fast 100 minutes in the dark. (--) Simply put, 'The Room' will change your life. (0)

[1] http://www.imdb.com/title/tt0368226/

ph0rque · on Oct 4, 2013

I tried to fool it with, "My excitement for the movie, as well as the director's IQ, was room temperature... in Celsius." and while the individual words were either positive, neutral, or unknown, the overall sentence was negative. I'm impressed!

kelvin0 · on Oct 4, 2013

The sentence: "It is refreshingly unoriginal and an utter failed masterpiece at best"

is parsed as being positive ... I guess the sarcasm detector still needs some work :)

ntoshev · on Oct 4, 2013

I fed it with "I'm surprised how well it can parse the sentiment of sample sentences I threw at it." and it thought that was negative - so I corrected it. The whole level of interaction and transparency with the model is great. "Buy" occurs 12 times. I wonder if stemming could help.

cantrevealname · on Oct 4, 2013

> I'm surprised how well it can parse the sentiment

I tried to fool it, but it took quite a bit of effort to contrive a sentence that it got wrong.

The sentence I finally fooled it with:

--> I enjoy clipping my toenails to paying any more attention to this movie.

which it rated as highly positive.

andrevoget · on Oct 4, 2013

I could fool the algorithm with: "Only the book cover was good."

ksrm · on Oct 4, 2013

It does seem to struggle here.

>I'd enjoy clipping my toenails more than paying any more attention to this movie. --> positive

>I'd enjoy clipping my toenails over paying any more attention to this movie. --> negative

If you change "paying" to "giving" it classes that clause as neutral instead of negative, but doesn't change the result.

eulerphi · on Oct 4, 2013

Your sentence is malformed. It should be "I'd enjoy clipping my toenails to paying any more attention to this movie."

Still, I'm not even sure if using the preposition "to" in that manner is proper english.

telephonetemp · on Oct 4, 2013

It isn't proper English. A proper version of this sentence could be "I'd prefer clipping my toenails to paying any more attention to this movie" but even then the "paying any more attention" part would probably cause a human sentiment analyser to do a double take at it.

stavros · on Oct 4, 2013

The software above correctly rates this as negative.

kelvin0 · on Oct 4, 2013

Also read as Positive: "It is refreshingly boring and is a monument to the achievement of utter platitudes"

6ren · on Oct 4, 2013

"It thinks it's just so smart."

FelixP · on Oct 4, 2013

It failed on "Are you sure that's a good idea?"

jfriedly · on Oct 4, 2013

I read the paper when this first showed up on HN[1]. The most important thing they did was to create a training set with higher granularity in the data than much of anything previously seen. Based on their training set, their algorithm was able to achieve 85% positive/negative accuracy on sentences, but previously state-of-the-art algorithms moved from 80% accuracy up to 83% accuracy when adapted to their training set. While their algorithm appears to be better than anything they tested against, this is fundamentally an incremental improvement, not groundbreaking research. The real win here came from using a better dataset.

[1] http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

Edit: formatting

vivekn · on Oct 4, 2013

Will be interesting to play around with that dataset.

PaulHoule · on Oct 4, 2013

I wouldn't quite say they "nailed" it.

One issue is that a system like this needs to be trained for the specific kinds of documents you are processing. For instance, if you are looking about people's opinions on stocks, there is specific terminology to look for such as "buy", "sell", "short" or "long", "missed earnings", price targets, etc.

This isn't so much a problem with their method, but it is a problem w/ the specific model they are publishing.

I like that they are using "beyond bag of words" methods and I find it very believable they could get much better results if they had a bigger training set and more effort in tuning.

One advantage us commercial folks have is that we don't need to bet on every hand. Reviews like that one of the "Room" are ambiguous at best and should be filed as so.

lightsidelabs · on Oct 4, 2013

First, let me say that this is really creative work and I'm glad it's being presented at EMNLP.

"Sentiment analysis" is too broad of a category to really cover in a single article like this. What they've done is taken a very difficult problem, sentence-level binary sentiment, and made solid progress on it. The baseline for this dataset using totally naive techniques is around 75%, and their results are the state of the art.

The move from 85% to 95% isn't really an interesting one. What really matters is exploring the numerous other open questions in the field of affect recognition, notably two thing:

* Sentiment at different granularities. Document level analysis has been far above 90% for years; this work is pushing forward sentence level. Other work is making great progress on targeted opinions even finer-grained than that, like looking at specific attributes of products. What if you like a movie's acting but not its plot? This structured nuance is not addressed here.

* Domain adaptation. You talk about movies in a different way from almost anything else. A movie review is positive if it's unpredictable; your opinion of the unpredictability of dishwashers or political candidates is probably different. For anything beyond movie reviews this method may work, but this particular dataset certainly won't.

Looking forward to seeing more from this group, as ever; Chris Manning's research team has an excellent reputation in the field.

eli_gottlieb · on Oct 3, 2013

What makes the Sentiment Treebank so novel is that the team split those nearly 11,000 sentences into more than 215,000 individual phrases and then used human workers via Amazon Mechanical Turk to classify each phrase on a scale from “very negative” to “very positive.”

Can someone here please explain whether the use of Mechanical Turk here is a cop-out from building a better computational model, or just an ordinary use of supervised learning in place of unsupervised?

dhammack · on Oct 3, 2013

This is just an ordinary use of supervised learning; they just used Mechanical Turk instead of undergrads. The point of the human classification was to generate the labeled data for model training.

The talk they gave, "Deep learning for NLP (without magic)" was pretty good: http://techtalks.tv/events/312/573/

kulkarnic · on Oct 4, 2013

It's neither, I think. It's basically a way to get training data. The novel part is splitting sentences into phrases and asking about the sentiment of each part, which provides the algorithm richer data to train off.

landongn · on Oct 3, 2013

Seems straightforward:

- certain combinations of words within phrases score all over the place

- hand those to mechanical turk for human classification

- understand where the results differ from the model

- patch the model where necessary when it breaks down.

The example they gave with the "but..." at the apex of the sentence is difficult primarily because it's ambiguous to what proceeds it. It could be positive or could be negative, especially from a programmatic standpoint.

Really fascinating stuff. Can't wait to see the code.

eli_gottlieb · on Oct 3, 2013

No, wait. So as someone who isn't remotely well-educated on machine learning, what teaching model would this be called? "Reinforcement learning" is an AI term, but it's more what this sounds like: sometimes the model is wrong, so we hand the data off to a human being who comes up with a definitive Right Answer which is then used to fix the model.

turing · on Oct 4, 2013

I believe this is strictly supervised learning. The mechanical turk is only used for the initial labeling of the data.

fergal_reid · on Oct 4, 2013

What "landongn" describes, where certain data 'score all over the place' - i.e. where the model tells you it is uncertain about some phrases, and where those phrases are subsequently manually annotated by a human, sounds like 'Active Learning'. [0]

If the model is telling us that it is uncertain about specific examples, and would like more information on examples like those, that's active learning.

That sounds different from what you describe in your post, depending on what you mean by 'sometimes the model is wrong, so we hand the data off to a human being'.

It depends on how we know the model is wrong.

If we know its wrong on a test datum, which is part of a big set of test data humans labelled without any input from the model, then its standard 'supervised learning'.

If, instead, the model is 'wrong' because it expresses uncertainty for particular test data, then, if we go and have a human classify that data it was uncertain about, and retrain the model, then we are probably doing Active Learning. In this case, the model/system is (at least partly) guiding the learning process.

Reinforcement learning is neither of these things exactly - it describes a more general framework, where the system is getting rewarded based on how well its performing.

Lets say you want to choose 1 of 5 labels for each datum. In supervised learning, the system gets given the right label for each training example. In a RL setup, it might be shown an example, have to guess a label, and maybe be told if it got the right guess, but if it guessed wrong, just told it was wrong - but not necessarily told what the right answer was.

There's a little fuzziness to how all these terms are used in practice.

[0] http://en.wikipedia.org/wiki/Active_learning_(machine_learni...

eli_gottlieb · on Oct 4, 2013

Yeah, and this is why I desperately need to take a machine learning class. Why oh why did I finish undergrad in seven semesters instead of eight?

MaxGabriel · on Oct 4, 2013

The CalTech course Learning from Data started on Monday. The professor has excellent reviews: https://news.ycombinator.com/item?id=6385602

So far I think they're accurate. Homework 1's due in a few days, but the lowest 2 homeworks are dropped.

eli_gottlieb · on Oct 4, 2013

Before signing up for an online course I'm pretty set on starting my new semester at Technion and figuring out what my Real Life We Will Count This Against You workload is.

vivekn · on Oct 4, 2013

While 95% accuracy would be a really phenomenal achievement, an accuracy in the range of 85-90% is achievable using methods simpler than deep neural nets. I have done some work on sentiment analysis in the past. I used a Naive Bayes model with some enhancements like n-grams, negation handling and information filtering and was able to get more than 88% accuracy on a similar dataset based on movie reviews.

You can find more details here -http://arxiv.org/ftp/arxiv/papers/1305/1305.6143.pdf and the code over here - https://github.com/vivekn/sentiment/blob/master/info.py

bambax · on Oct 4, 2013

Just tried with this great phrase from the late Roger Ebert (slightly modified to fit in one sentence; the original is four different sentences):

> The movie has been signed by Michael Bay: this is the same man who directed "The Rock" in 1996; now he has made "Transformers: Revenge of the Fallen", and, well, Faust made a better deal.

It correctly identifies the sentence as negative, while all words taken individually are either neutral or positive... I'm impressed.

joeblau · on Oct 4, 2013

This is HUGE. So many companies are trying to use sentiment analysis as their marketing tool for how they parse social media. With an open source tool, it would make it easier for regular developers who man not know much about NLP to tap into that part of the industry.

As I'm reading though the article I see that it says the algorithm can understand "Human Language." By this I'm guessing they mean English. One thing I learned about sentiment analysis is that analyzing other languages may prove to be a bit more difficult.

Another question I have is to run it up against this very basic sentiment analysis engine that my old manager built which basically had 13 positive words and 13 negative words and was about 80% accurate as well: no neural networks, AI or machine learning needed.

vivekn · on Oct 4, 2013

The model was trained on an English data set, but if you train the same model on some other data, it can handle other languages as well.

joeblau · on Oct 4, 2013

Okay, I'm asking because I know that other languages have nuances that english doesn't. I didn't realize that the algorithm could perform at such a high accuracy while still being language agnostic.

biot · on Oct 4, 2013

The simple phrase "Not bad." results in a negative sentiment. This should be at least neutral, if not slightly positive. Interestingly, omitting the period gives a neutral result.

bkmartin · on Oct 4, 2013

I would think this would be one of the easiest to get right... bad=negative Not=negative ... two negatives=positive

vivekn · on Oct 4, 2013

Hmm, they should have handled negations, its not that hard to implement.

fauigerzigerk · on Oct 4, 2013

Over time and with more sample sentences, Socher thinks his model could reach upward of 95 percent accuracy

It would be interesting to read the paper to find out what accuracy really means here. I doubt that human readers agree on the sentiment of movie reviews 95% of the time.

eadlam · on Oct 4, 2013

Stanford Ph.D. student Richard Socher appreciates the work Google and others are doing to build neural networks that can understand human language. He just thinks his work is more useful ...

"We’re actually able to put whole sentences and longer phrases into vector spaces without ignoring the order of the words."

Wait, didn't Mikolov et al. (Google) [just figure out][1] how to put entire languages into vector spaces?

[1]: http://arxiv.org/abs/1309.4168

nl · on Oct 4, 2013

As someone who has done some work in the sentiment analysis field, I present this comment as the perfect example of why sentiment analysis is easy and the linked research is clearly bunk.

utopkara · on Oct 4, 2013

Just having the state of the art as open source is in itself fantastic. The fact that their approach is a considerable improvement over the previous approaches is icing on the cake.

PeterisP · on Oct 5, 2013

Actually, for pretty much any NLP problem the state of art is open source - they often aren't packaged as convenient libraries, but the actual best-in-field methods usually have both detailed algorithm descriptions in the published papers (from which we can and sometimes do a direct reimplementation), and a reference implementation with available source, that they used to get the measurements proving that it really is state of the art. Sure, those research implementations tend to be 'not-production' level of polish, often needing some pain to install and convert your particular data; but they are available. In a few cases the best known method is a commercial implementation; but then usually the #2 implementation is almost as good and that's available.

bkmartin · on Oct 4, 2013

It got mine wrong...

"That makes about as much sense as a whale and a dolphin getting it on."

Keep working on it guys... I wish I understood sentiment trees well enough to be able to train it properly for this statement... Is a sentiment tree able to properly represent sarcasm and innuendo? <--- Honest question

yannyu · on Oct 4, 2013

Highly idiomatic phrases are always going to be problems for natural language processing. Further, people aren't 100% accurate in language processing either.

I'd say that if you posed that statement to 1000 English speakers from around the world, at least 1% of them would be baffled by it.

All that is to say non-conventional uses of language are always going to be a problem for natural language processing. If a certain kind of innuendo and sarcasm is represented often in its training data, then the model SHOULD be able to understand it when it sees it again.

aroman · on Oct 5, 2013

I wonder what the accuracy for native English speakers is in doing ternary sentiment analysis.

I also wonder about sentences which could be understood and defended as being positive to one human reader and negative to another.

"That is the craziest thing I've ever heard." or simply "That is sick."

GFischer · on Oct 4, 2013

Wow, it will open more possibilities a lot of companies I know of (and some projects of mine too :) ).

Off the top of my head, I know of a company that's trying to tackle online complaints (VozDirecta.com), another that feeds "what they're saying about your company"...

aantix · on Oct 4, 2013

I feel like Borat for doing this, but I entered :

"I loved this movie.. NOT!"

and it classified it as positive. :)

koyote · on Oct 4, 2013

I did the same!

It also doesn't seem to like swear words and web abbreviations (like j/k for example). Perhaps with more data (from twitter or similar) it could go a long way though; as it definitely needs to learn about the more uncouth words.

vivekn · on Oct 4, 2013

Thwarted expectations and sarcasm are difficult to handle. You are down to the choice of optimizing for this particular case or the majority of text.

gojomo · on Oct 4, 2013

You've found the 15% it gets wrong. Though, given the other tricky cases it does well on, more training data might help.

saraid216 · on Oct 4, 2013

Borat is not a measure by which any sentiment analysis should be done... :P

Abundnce10 · on Oct 4, 2013

Are there any links to the code?

mtraven · on Oct 4, 2013

Not yet, they said it would be released in late October. Paper is here: http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

ciferkey · on Oct 4, 2013

I'm talking an NLP class this semester and its nice to finally be able to dig into material like this rather than giving it a light read through. Can't wait until the code drops!

TallGuyShort · on Oct 4, 2013

Very impressive! Not to detract from how impressed I am, but I did manage to trick it once: "It could be better" was postive / very positive.

anaphor · on Oct 4, 2013

I'll be interested to read Language Log's (namely Mark Liberman's) opinion of this once it gets released.