I can't recommend SpaCy enough. We use their Prodigy[1] app here at work, it's outstanding.
After experimenting with gensim, nltk, and most everything else under the sun, we primarily rely on SpaCy now with some TensorFlow for specific models.
Glad to hear it! Actually we noticed the order come through for your group and felt it was a shame we didn't have a proper contact email. We're especially pleased to support public institutions like the FRB. I also did a little bit of consulting for the Bank of Canada. Could you email me? matt@explosion.ai
I wasn't too successful running it against tweets (low hit rate/false positives, low spatial resolution) but geolocating tweets is a hard problem and I'm sure it would work better against more structured text.
I recommend Prodigy to label your examples and train a Spacy model. Prodigy is the best tool I have ever used for NLP labeling. Most likely starting with a blank model will work better, but you can try starting with one of Spacy's pre-trained models.
MIT License for spaCy, though you'd never know it from the home page of the site. It is really frustrating to have such a nice tool with such a beautiful site, but have the site have no mention of the word "License" on it. Yes, it's in the code, but if one goes to the trouble to make such a great site, well...
spaCy's license has evolved over time. I know when its creator was discussing it on HN a few years ago, he had it under AGPL [0]:
> This was my understanding --- actually I designed the licensing structure of this project around the assumption that companies would not want to use GPL licensed code commercially. I offer an AGPL license, and offer a commercial license for a fixed fee.
A couple years ago or so it went MIT, I think he might have mentioned it on HN at the time.
How good is spaCy for exploratory analysis of a corpus?
I'm thinking, questions like, top adjectives applied to men and women, or top word usage across document classes.
The best I've come across for doing this easily is in R, where they have tidytext[1] which is nice, and very straight forward to understand and work with. However, the data model stores each token and all its meta data (page number, sentence number, document id / title, document class, word type, etc, etc) in its own row, causing the in memory size of whatever corpus you are working on to explode.
spaCy in production for some grammar-based tokenization (frame.ai)! And their model for span annotations is awesome for our research pipeline; very convenient to be able to traverse pipeline objects in both directions to explore results.
Thank you for the clarification. I don't want to further speculate about the issue but as as annual Datacamp subscriber I am wondering what effect could it have on the service.
I guess it means sharp drop in new courses but what about the existing ones, could the authors remove them from the platform?
That question is also addressed in the blog post I linked to:
"In general, no. Contract terms vary a bit, but for me: I retain IP and the right to post it elsewhere, but grant DC a perpetual, non-exclusive license to the course material in return for royalties. I've requested breaking the contract by mutual consent, have not heard back."
Thanks again, I saw this at work and didn't read through the updates. I really like the convenience of Datacamp platform and their "small-bite" approach to learning and I would be sad to see the most of external courses removed(if they broke the contract) but I also understand and applaud all instructors who want to show their disagreement that way.
I followed the outrage about Datacamp loosely on Twitter. At first I thought it was about an actual assault, but so far afaik the only thing publicly known is "uninvited contact", on a dancefloor and the victim reported that several months later.
Sexual assualt or abuse is a crime and should be punished and prosecuted. However, if we treat every situation where two people interact physically and have a different understanding of their relationship as sexual assault, there won't be any sexuality without a written consent form in the future anymore and real victims get marginalized at the same time. I don't know if that is really what we should be aiming for.
I don't see Datacamps reaction as inappropriate in this case,at least if there is nothing more to the story behind the curtain. Humans do errors and in most cases everybody should get a second chance.
One moment of misjudgment can be enough today to get fired, you name gets burned forever in the Internet and your life can get turned upside down. And I don't know if that helps the victim in any way.
Probably I will be downvoted for that, but I want to make clear again that I don't tolerate this behaviour in any way and I think the industry needs to be more inclusive and diverse. I just think that you have to differentiate and that public shaming is not on all cases a good solution.
Without further elaboration on what "uninvited physical contact" means, this smells like manufactured outrage. It's definitely not right to conflate uninvited contact with sexual assault this way, because that trivialises actual sexual assault.
The outrage isn't because the incident happened. It's because DataCamp didn't do anything to reprimand the executive, and because they issued a completely tone deaf response, and because they tried to hide it.
I'm against bogus sexual assault claims as much as anyone but if you're DataCamp you really have no choice but to just fire this guy and move on (and it's obviously too late now to even do that).
Couldn't the victim simply call the police or notify the prosecutor (or whatever is the proper procedure to report a crime)? If DataCamp executive committed a crime, then this will be investigated and a person will be punished according to the rules of law.
Why the victim and other people who were witnessing what happened count on executives's partners and colleagues to "take proper action". In the civilized society courts should be engaged in such situation, we don't need to relay on some ad hoc, possibly bias committees to punish people. If this executive did something wrong, let him rot in prison or whatever is the proper punishment.
There’s a grey area between what is (should be?) unacceptable in a professional setting, and what constitutes a crime. Even in particularly egregious cases of sexual misconduct, it’s difficult to think of how many victims chose to go to trial. Civil settlements are much more common. Think Roger Ailes at Fox.
The local prosecutor in whatever jurisdiction this incident occurred in is likely to decline prosecuting the case. Prosecuting instances of sexual harassment (assault in this case?) Is not typically a prosecutor’s first choice for how to use their office’s limited time and resources.
Maybe she could’ve filed a claim with the EEOC? But, that’s a long drawn out process as well. Most victims of harassment may just choose to try a company’s internal HR processes, since an EEOC claim may require spending inordinate amounts of time, and maybe money if you hire professional help.
From the dozens of Twitter and personal blog testimonials of DataCamp instructors (contractors), it seems that but for collective pressure and organizing from the instructors, DataCamp would’ve just swept this whole incident under the rug.
There is no grey area here. There are things that are classified as crime and there are things classified as inappropriate behaviour. The "grey area" happens when people on purpose try to mix the two and manufacture outrage.
After experimenting with gensim, nltk, and most everything else under the sun, we primarily rely on SpaCy now with some TensorFlow for specific models.
1. https://prodi.gy/