Hacker News new | past | comments | ask | show | jobs | submit login
Try out Stanford's CoreNLP natural language software (corenlp.run)
126 points by apsec112 on Oct 7, 2015 | hide | past | favorite | 35 comments



CoreNLP is a very good baseline for any NLP work. If you can build something that beats it on a specific benchmark it's a pretty good bet you have something that's pretty close to the state-of-the-art.

But.. there are problems. As software engineers, the (many) authors make great researchers.

CoreNLP is wonderful in the many different ways it almost lets you integrate into it without hacking the code. Changing the config of the various components is fantastic, because you get a very comprehensive view of many different ways people can configure almost the same thing. Environment variables? System properties? Properties files in a specific location? In the classpath? Json config? YAML? It almost supports them all - or rather different parts use different ones, and the only way to work out exactly how it works is to read the code.

Also, the licensing is annoying. Everyone doing commercial stuff with it just puts it behind a web service anyway, so they should just embrace non-viral license and get some input from the community.

Also SUTime. Yes, it works mostly, but wow :(

(Sorry for the rantish post. I use parts of CoreNLP a lot, and I'd love to see it improve.)


Have you used spacy.io for comparison? The author seems to enjoy trolling about the faults of CoreNLP, but the license is very agreeable and the results seem on par.


I've looked at it, and played with a bit. I think when I last looked at it the licensing was even less friendly than CoreNLP. I believe this has been fixed though, so it's probably worth looking at again.

From memory when I was last looking around I cared mostly about named entity recognition (NER) and Spacy themselves say CoreNLP is better. CoreNLP has more features too.

WordVector integration in Spacy looks interesting though. That's probably enough to make me have a play with it.


I suggest Gensim for Word2Vec.

Spacy is really fast, the author is extremely knowledgeable, and it works well on the datasets it was trained on. Problem with SpaCy for me was that it was pre-trained on those texts and it was not possible to train it on new things. Also their parser wasn't very customizable.

This pre-trained model is borderline useless if you want to obtain good results on your data, which is probably very different from the data they trained on.

For NER in the Python world, the best option is pycrfsuite. It works really quickly and lets you easily define your own features. CRFSuite itself is a work of art. I only wish Okazaki incorporated 2nd order transitions because that makes a huge difference on some datasets.

CoreNLP is a lot worse than pyCRFsuite in terms of ease of integration and performance in my experience. Particularly, if you want to define your own features.


Your Word2Vec comments are interesting. I've had pretty good success with that original trained dataset the original Word2Vec implementation shipped with. That's a big dataset of course, but one of the strengths of the model is that if your training dataset is big enough you don't really need a lot of specialized training.

CRFSuite

I've never used this, but it's not really a ready-to-use NLP toolkit is it? Isn't it more a tool for building NLP tools with?


Gensim's implemention is better than the original and allows for Python API access to all the features. Highly recommended. Radim tends to write really memory efficient code unlike some other Python libs, so you can deal with large datasets.

If you want to do NER in a way that doesn't suck there is no way around making your own model on your own training data.

It honestly takes only a few days of labeling things yourself. I found that outsourcing the work to amazon turk is not a viable option because the graders there are terrible. And they work about 30x slower than you do. Even if you pay them $1/hr, that is like paying one person $30/hr. I'm not kidding.

Sure you can do a quick and dirty "send data to these guys and they'll do all the work", but I haven't come across a model that works well on all datasets. We're talking going from 30ish percent accuracy for a model not trained on your dataset to low 90s for something trained on your dataset. Of course, these are approximate numbers and it is definitely possible that your dataset is almost exactly like the ones they trained their model on.

It's incredibly simple to make your own model.

1. Label your data with brat: http://brat.nlplab.org/index.html # 5 days for 2k one page documents.

2. Tokenize data with nltk/spaCy. Come up with features and label using pycrfsuite: http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blo... # 1 day

3. Do more labeling, retokenizing, neural embedding from word2vec's similar words to the tokens you have, tune parameters or come up with better features such as your own dictionaries of entities, etc. Retrain the model. #2 weeks.

4.Done. Now you have a memory efficient fast model tuned on your data. You can label anything you want. Not just Person/Company, but things like car vs bicycle brands, computer parts, obfuscated email addresses, etc.


I'd endorse pretty much all of this.

I want to make "domain adaptation as a service" the key part of spaCy's business model: you send us text, we send you a good model. Internally this will probably involve annotating part of the text, but that's a tactical decision we'll make.

I hope we can make some break-throughs that help NER be much more general than it is currently. But the current solution you describe works fine; it's just a pain in the ass for each organization to take on. We want to have the required infrastructure and expertise set up, and make the process seamless.


I think when I last looked at it the licensing was even less friendly than CoreNLP.

I has recently changed to MIT.


  Also SUTime. Yes, it works mostly, but wow :(
So what's better than SUTime? I'm using it now and am genuinely curious.


Nothing :(

But the code and config mechanisms are terrible to use and the documentation of them is even worse.

It's ok if you want to use it as it is out of the box. But try to do something like change to forward looking dates ("On Monday" should mean next Monday instead of last Monday) and it isn't as easy as it should be!


Ok, yes, you're absolutely right, thanks for responding. I have customized it a bit (e.g. changing "yesterday" to return a TIMEX3 Date instead of a date/time, roughly speaking, I'm probably forgetting the details). I was really hoping you knew of something better ;)


For what sorts of applications do you use NLP?


Well..

Originally I got interested because of an essay PG wrote about Bayesnian spam filtering, so I wrote a Bayesian classifier in Java (this was over 10 years ago, when that was pretty cutting edge).

That led to text summarisation - still open source Java stuff, and apparently now considered state-of-the-art[1] (I'm kind of amazed, because that code is 10 years old).

Then I did some AdTech stuff, wrote an open domain natural language question answering thing (like Watson, but not as good - but it was just for fun. bAIb is the way to approach this now if anyone is interested).

Now-days I'm using NLP for future event prediction.

[1] http://dl.acm.org/citation.cfm?id=2797081


I'm curious to know what you mean by bAIb?



I took CoreNLP for a test drive last year during a content driven recommendation project. Unfortunately we didn't have an opportunity to use it on the project, but I was very impressed with what I saw. I was made aware of it's existence after Stanford's online Sentiment Analysis(http://nlp.stanford.edu/sentiment/) demo was released


Mmhh, I tried with spanish and didn't work.

I see at least that identifies most words as FW, "foreign word", perhaps?

I'm not very familiar with CoreNLP (or NLP, more generally), does it only support english out of the box? Or is it nothing out of the box and this is configured in english?

Google translate does a pretty good job of guessing the language of non-ridiculously-short-and-ambiguous sentences. Also, from what I know about how it works, it seems quite agnostic to specific languages. Does a less stochastic approach (as what I assume NLP does) provide such flexibility? Something akin to the nlp library knowing all languages, and deciding on which of them a given sentence makes sense. I can't put an example on the table right now, but surely there are sentences that can work on more than one language; at least when you accept a couple of misspells.


Any link/bridge to call python NLP libraries mentioned in this thread from Node.js?


Ha! NLP still does not get 'time flies like an arrow' phrase. It still thinks 'time flies' is a compound noun, apparently some kind of time-travelling flies.

I remember reading about computers being confused by this phrase in 80s and 90s. Apparently, not much progress here.


No, it gets "time flies like an arrow", but it can't handle the garden path in "time flies like an arrow, fruit flies like a banana."

Which humans with only one pass can't either.


http://spacy.io/displacy/?full=time%20flies%20like%20an%20ar...

You can download the spaCy library for yourself if you suspect I've cooked the example :)


Unfortunately it fails with the suggestion provided by wodenokoto below:

http://spacy.io/displacy/?full=fruit%20flies%20like%20a%20ba...


Pretty unfortunate that the default demo sentence gets parsed wrong.


No one was supposed to see this site :(. Our official demo server is http://nlp.stanford.edu:8080/corenlp/process. We'll change the demo sentence though.


Looks good to me?

What is it getting wrong?


The dependency parse in the demo for "my dog also likes eating sausage" has "eating" as an adjective modifying "sausage". It's as if "eating sausage" were a kind of sausage.

The correct parse has "eating" as an xcomp of "likes", and "sausage" as a dobj of "eating". You can see the correct parse for this structure if you plug "I like eating sausage" into the demo.

EDIT: Weirdly, the demo of the same software at http://nlp.stanford.edu:8080/parser/ doesn't make this mistake!


Yes, I missed that. Good pickup.

Weirdly, the demo of the same software at http://nlp.stanford.edu:8080/parser/ doesn't make this mistake!

See my rant about the config of CoreNLP: https://news.ycombinator.com/item?id=10350090


These are likely using different parsers. Namely, the (faster) neural net dependency parser running at corenlp.run -- which gets thrown off by the POS tag error -- versus the constituency parser you linked to.

You're right that the configuration is confusing for new users. But, you have to remember that this is first and foremost research code intended to be flexible enough for the Stanford NLP group's research needs. In terms of particular configuration sources, nearly everything should be configurable from properties passed in as a properties file. Are there exceptions to this rule?


In terms of particular configuration sources, nearly everything should be configurable from properties passed in as a properties file. Are there exceptions to this rule?

SUTime is documented to accept a property to a file that is read to obtain other properties. Quote:

sutime.rules = [path to rules file][1]

I'm unclear if that works - looking at our code it appears the properties files need to be in the classpath under a package that is defined by the sutime.rules property.

I don't remember how other packages worked.

[1] http://nlp.stanford.edu/software/sutime.shtml


This should be able to read rules from the filesystem as well (if not, it's a bug and we should fix it!). The reason all of the examples are classpath examples is just because we distribute our models as a jar by default. The SUTime rules could be thought of more as a "model" for the rule-based system rather than a configuration file.


Hmmm. I tried "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo."

Didn't get it right.

(See https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal... )

Did better on "Colorless green ideas sleep furiously."


But that Buffalo sentence is something on which people also stumble. A machine can be contrived to pass such a test case, while remaining a lousy parser compared to people.

The real problem is that the parser comes so swiftly to wrong conclusion and cheerfully presents it as a valid result.

It would look a lot better if it simply reported "error: cannot parse that". (Better yet, with reasons: "I cannot parse that because I get stuck on this specific ambiguity and it's just too much for me.").

Also, what about the possibility of multiple results? Language is ambiguous. If something has two parses, it's wrong to assert just one.

This thing has made no consideration whatsoever that even a single instance of "buffalo" in the sentence might conceivably be a verb, which flies in the face of almost any noun in English being verbable.


But it won't ever have trouble, because it's not trying to understand the sentence. It will tell you the most probable parsing of that sentence based on its model, whether or not it makes sense to a human.


People who manage parse the sentence also aren't trying understand it, except as far as "Buffalo" is a proper noun denoting a city, which can be used to form then phrases "Buffalo buffalo" == buffalo of/from/belonging to/related to Buffalo, and trying various combinations of interpreting "buffalo" as a noun (in various roles as subject, direct object and so on) or verb, and determining elided words such as "which" or "that" complementizers heading off phrases and embedded clauses.

It's almost purely syntactic reasoning. Searching these spaces of possibilities is something which, you would think, a "natural language parser" ought to be doing to earn its name.

Nobody actually knows what it means "to buffalo" something; it is not necessary to know. People solve the parse in spite of knowing that there is nothing to understand in the sentence.


"buffalo" can mean something like "bother" as an English verb (at least in American informal use), so the whole sentence as parsed in English does have a concrete mental image associated with it, in case that makes any difference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: