Advanced NLP with SpaCy

ZeroCool2u · on April 18, 2019

I can't recommend SpaCy enough. We use their Prodigy[1] app here at work, it's outstanding.

After experimenting with gensim, nltk, and most everything else under the sun, we primarily rely on SpaCy now with some TensorFlow for specific models.

1. https://prodi.gy/

syllogism · on April 18, 2019

Glad to hear it! Actually we noticed the order come through for your group and felt it was a shame we didn't have a proper contact email. We're especially pleased to support public institutions like the FRB. I also did a little bit of consulting for the Bank of Canada. Could you email me? matt@explosion.ai

ZeroCool2u · on April 18, 2019

Haha, hi Matt! Sure, just messaged you.

growlist · on April 18, 2019

Mordecai uses SpaCy and is worth a look for extracting place names: https://github.com/openeventdata/mordecai

I wasn't too successful running it against tweets (low hit rate/false positives, low spatial resolution) but geolocating tweets is a hard problem and I'm sure it would work better against more structured text.

rpedela · on April 18, 2019

Are you using the pre-trained NER models or your own? If the former, I wouldn't expect it to work well on tweets since it wasn't trained on them.

amrrs · on April 18, 2019

Do you have any recommendations for building a custom language model for business-specific NER?

rpedela · on April 18, 2019

I recommend Prodigy to label your examples and train a Spacy model. Prodigy is the best tool I have ever used for NLP labeling. Most likely starting with a blank model will work better, but you can try starting with one of Spacy's pre-trained models.

https://prodi.gy/

timkpaine · on April 18, 2019

I also recommend looking at ipyannotate if most of your workflow is in jupyter

mwexler · on April 18, 2019

MIT License for spaCy, though you'd never know it from the home page of the site. It is really frustrating to have such a nice tool with such a beautiful site, but have the site have no mention of the word "License" on it. Yes, it's in the code, but if one goes to the trouble to make such a great site, well...

danso · on April 18, 2019

spaCy's license has evolved over time. I know when its creator was discussing it on HN a few years ago, he had it under AGPL [0]:

> This was my understanding --- actually I designed the licensing structure of this project around the assumption that companies would not want to use GPL licensed code commercially. I offer an AGPL license, and offer a commercial license for a fixed fee.

A couple years ago or so it went MIT, I think he might have mentioned it on HN at the time.

[0] https://news.ycombinator.com/item?id=8944705

ausjke · on April 18, 2019

sorry but why MIT license is bad here?

cwyers · on April 18, 2019

I don't think parent is complaining about the MIT license, they are complaining about how hard it is to find out what license is being used.

baconerrday · on April 18, 2019

I think he's just saying that it's a shame, because if people knew it was MIT licensed they would be more likely to use it.

acconrad · on April 18, 2019

I'm using NLP with Spacy for my slack bot and it's awesome! Glad to see they're offering a more advanced course to use their library.

avmich · on April 20, 2019

Does it handle dialogs?

_fx6v · on April 18, 2019

Curious how people are using SpaCy in production with constant improvement of the model and how that workflow happens from user input & review.

topicseed · on April 18, 2019

spaCy's updated Pattern Matcher is pretty amazing and we use it extensively in our textual content analyses to help SEOs.

wodenokoto · on April 18, 2019

How good is spaCy for exploratory analysis of a corpus?

I'm thinking, questions like, top adjectives applied to men and women, or top word usage across document classes.

The best I've come across for doing this easily is in R, where they have tidytext[1] which is nice, and very straight forward to understand and work with. However, the data model stores each token and all its meta data (page number, sentence number, document id / title, document class, word type, etc, etc) in its own row, causing the in memory size of whatever corpus you are working on to explode.

[1] https://www.tidytextmining.com

evrydayhustling · on April 18, 2019

spaCy in production for some grammar-based tokenization (frame.ai)! And their model for span annotations is awesome for our research pipeline; very convenient to be able to traverse pipeline objects in both directions to explore results.

anentropic · on April 18, 2019

This is very nicely presented. Kudos!

syllogism · on April 18, 2019

GitHub repo: https://github.com/ines/spacy-course

Might also be interesting to others who have DataCamp courses they'd like to release free.

charlescearl · on April 18, 2019

Many thanks to Ines for releasing not only the tutorial but the framework! Sharing and ethics has always been a hallmark of the spaCy team.

kyllo · on April 18, 2019

> So should I not take your DataCamp course anymore? Probably not, no.

Here's some context for anyone who's wondering why: https://noamross.github.io/datacamp-sexual-assault/

didymospl · on April 18, 2019

Thank you for the clarification. I don't want to further speculate about the issue but as as annual Datacamp subscriber I am wondering what effect could it have on the service. I guess it means sharp drop in new courses but what about the existing ones, could the authors remove them from the platform?

kyllo · on April 18, 2019

That question is also addressed in the blog post I linked to:

"In general, no. Contract terms vary a bit, but for me: I retain IP and the right to post it elsewhere, but grant DC a perpetual, non-exclusive license to the course material in return for royalties. I've requested breaking the contract by mutual consent, have not heard back."

https://twitter.com/noamross/status/1117050638955892742

didymospl · on April 18, 2019

Thanks again, I saw this at work and didn't read through the updates. I really like the convenience of Datacamp platform and their "small-bite" approach to learning and I would be sad to see the most of external courses removed(if they broke the contract) but I also understand and applaud all instructors who want to show their disagreement that way.

tasubotadas · on April 18, 2019

Where did the good old way of calling them pieces of shit and just leaving the job go?

Apparently, now it is also necessary to get someone fired and/or closing the business entirely.

zimpenfish · on April 18, 2019

> Where did the good old way of calling them pieces of shit and just leaving the job go?

Why should someone have to leave their job because an executive is a "piece of shit"?

jpdus · on April 18, 2019

Unpopular opinion here:

I followed the outrage about Datacamp loosely on Twitter. At first I thought it was about an actual assault, but so far afaik the only thing publicly known is "uninvited contact", on a dancefloor and the victim reported that several months later.

Sexual assualt or abuse is a crime and should be punished and prosecuted. However, if we treat every situation where two people interact physically and have a different understanding of their relationship as sexual assault, there won't be any sexuality without a written consent form in the future anymore and real victims get marginalized at the same time. I don't know if that is really what we should be aiming for.

I don't see Datacamps reaction as inappropriate in this case,at least if there is nothing more to the story behind the curtain. Humans do errors and in most cases everybody should get a second chance. One moment of misjudgment can be enough today to get fired, you name gets burned forever in the Internet and your life can get turned upside down. And I don't know if that helps the victim in any way.

Probably I will be downvoted for that, but I want to make clear again that I don't tolerate this behaviour in any way and I think the industry needs to be more inclusive and diverse. I just think that you have to differentiate and that public shaming is not on all cases a good solution.

madenine · on April 18, 2019

I don’t think firing two employees for “poor performance” after they voiced concerns internally as an ‘appropriate’ reaction.

I don’t see responding to a letter from ~100 of your content producers with a legalese blog post purposely hidden from search engines as ‘appropriate’

People do shitty things; that doesn’t have to reflect on the organization that employs them unless the org decides to do shitty things as well.

thanatropism · on April 18, 2019

I agree with your opinion but will fight to the death for your right to say it!

(Wait, I'm getting this wrong somehow.)

taneq · on April 18, 2019

Without further elaboration on what "uninvited physical contact" means, this smells like manufactured outrage. It's definitely not right to conflate uninvited contact with sexual assault this way, because that trivialises actual sexual assault.

QuackingJimbo · on April 18, 2019

The outrage isn't because the incident happened. It's because DataCamp didn't do anything to reprimand the executive, and because they issued a completely tone deaf response, and because they tried to hide it.

I'm against bogus sexual assault claims as much as anyone but if you're DataCamp you really have no choice but to just fire this guy and move on (and it's obviously too late now to even do that).

piokoch · on April 18, 2019

Couldn't the victim simply call the police or notify the prosecutor (or whatever is the proper procedure to report a crime)? If DataCamp executive committed a crime, then this will be investigated and a person will be punished according to the rules of law.

Why the victim and other people who were witnessing what happened count on executives's partners and colleagues to "take proper action". In the civilized society courts should be engaged in such situation, we don't need to relay on some ad hoc, possibly bias committees to punish people. If this executive did something wrong, let him rot in prison or whatever is the proper punishment.

cepth · on April 18, 2019

There’s a grey area between what is (should be?) unacceptable in a professional setting, and what constitutes a crime. Even in particularly egregious cases of sexual misconduct, it’s difficult to think of how many victims chose to go to trial. Civil settlements are much more common. Think Roger Ailes at Fox.

The local prosecutor in whatever jurisdiction this incident occurred in is likely to decline prosecuting the case. Prosecuting instances of sexual harassment (assault in this case?) Is not typically a prosecutor’s first choice for how to use their office’s limited time and resources.

Maybe she could’ve filed a claim with the EEOC? But, that’s a long drawn out process as well. Most victims of harassment may just choose to try a company’s internal HR processes, since an EEOC claim may require spending inordinate amounts of time, and maybe money if you hire professional help.

From the dozens of Twitter and personal blog testimonials of DataCamp instructors (contractors), it seems that but for collective pressure and organizing from the instructors, DataCamp would’ve just swept this whole incident under the rug.

StreamBright · on April 18, 2019

There is no grey area here. There are things that are classified as crime and there are things classified as inappropriate behaviour. The "grey area" happens when people on purpose try to mix the two and manufacture outrage.