Show HN: Languagecrunch – NLP server Docker image

syllogism · on Jan 14, 2018

Nice!

Relevant links for anyone interested:

* spaCy on Github: https://github.com/explosion/spacy

* NER demo: https://demos.explosion.ai/displacy-ent/

* Neural coref by HuggingFace: https://huggingface.co/coref/

* Accuracy of built-in spaCy models: https://spacy.io/usage/facts-figures

Last time I calculated, the lowest cost way to run spaCy in the cloud was on Google Compute Engine ns1-standard pre-emptable instances. It should be over 100x cheaper per document than using Google, Amazon or Microsoft's cloud NLP APIs. Accuracy will depend on your problem, but if you have your own training data, performance should be similar.

Xeoncross · on Jan 14, 2018

I've run spaCy on small $10/m linux VPS instances. What do you mean by cheapest? I'm sure you're right, but what volume are you referring too?

syllogism · on Jan 14, 2018

I'm referring to best price per word when the service is continually active. Like, if you want to parse a web dump, what type of instance do you provision a bunch of?

wyldfire · on Jan 15, 2018

This coref thing [1] is great, the visualization is handy. I'd love to see it combined with NER [2].

[1] https://goo.gl/JnS95x

[2] https://goo.gl/k5YRjN

artpar · on Jan 14, 2018

GCE's pre-emptable instances are so much easier to use and manage compared to AWS's spot instances. I made a rule to make stateless services only just to leverage these.

ivan_ah · on Jan 14, 2018

What do you need on the docker host machine to run this? Any specific docker version? GPU?

Also, it would be useful to see the Dockerfile or script that generated this img.

artpar · on Jan 14, 2018

No special requirement. I run it on a ubuntu 16 server for production and locally on osx.

Added GPU would probably help both spacy and neuralcoref in performance.

> Also, it would be useful to see the Dockerfile or script that generated this img.

Will put it on github shortly.

tobilg · on Jan 14, 2018

Seems interesting! How can it be used with other languages than English?

visarga · on Jan 14, 2018

spaCy only works with English, German, Spanish, Portuguese, French, Italian and Dutch.

FastText for example has pretrained embeddings for 294 languages: https://github.com/facebookresearch/fastText/blob/master/pre...

Google's Parsey McParseface handles POS tagging for 53 languages: https://github.com/tensorflow/models/blob/f87a58cd96d45de73c...

artpar · on Jan 14, 2018

So spacy has support for these languages [1] and wordnet has support for these [2], but neuralcoref (pronoun resolution endpoint) is available only for english.

This current docker image is not exposing those other languages but I can expose them in an update if it helps a lot of people.

[1] https://spacy.io/usage/models [2] http://compling.hss.ntu.edu.sg/omw/

tobilg · on Jan 14, 2018

Thanks for the insights. Could you please share the Dockerfile so that one can make the other languages work?

artpar · on Jan 15, 2018

https://github.com/artpar/languagecrunch

make3 · on Jan 14, 2018

SpaCy models for different languages and how to use them: https://spacy.io/usage/models

laretluval · on Jan 14, 2018

The demo examples are wrong or don't make much sense.

"Donald Trump's administration" is not a person.

In the following example, "The currency" is not a subject and "India" is not an object.

I don't know how much useful information is extracted by this system.

syllogism · on Jan 14, 2018

That example is a tweet, which the syntax and NER models haven't been trained on. You can make calls to `nlp.update()` to improve it on your own data. We also have an annotation tool, https://prodi.gy , to more quickly create training data.

(I'm the author of spaCy, not this Docker container.)

laretluval · on Jan 14, 2018

SpaCy is wonderful, I've used it a lot over the years and I have high confidence in its output.

I just wish the author of this docker container chose demo sentences that advertised it better.

codegladiator · on Jan 14, 2018

> "The currency" is not a subject and "India" is not an object.

But "subject" and "object" is for indicating the Subject-Verb-Predicate (object) of the sentence and not as in literal object ?

laretluval · on Jan 14, 2018

"India" is neither the predicate nor the object of the sentence.

artpar · on Jan 14, 2018

You are correct. That is clearly a wrong example. Will change that.

Also that issue is in my code (poor naming choice). Will put up the code on github soon. Hope that will help.

artpar · on Jan 15, 2018

Source on github: https://github.com/artpar/languagecrunch

stevemk14ebr · on Jan 14, 2018

@artpar can you post some docs on the endpoints and how they should be used. I want to tie this into a speech to text system but i need more api info

artpar · on Jan 15, 2018

Added docs on the bottom of the readme.

https://github.com/artpar/languagecrunch

zengid · on Jan 14, 2018

Cool! What corpus was this trained on?

artpar · on Jan 14, 2018

Using the "en_core_web_lg" for spacy [1] and neuralcoref along with the pre-trained models on github [2].

[1] https://spacy.io/models/en [2] https://github.com/huggingface/neuralcoref/tree/bee05b1b55e3...