Hacker News new | past | comments | ask | show | jobs | submit login
Advanced NLP with spaCy v3 (spacy.io)
207 points by pvpv on Dec 10, 2021 | hide | past | favorite | 38 comments



We've been using spaCy a lot for the past few months.

Mostly for non-production use cases, however, I can say that it is the most robust framework for NLP at the moment.

V3 added support for transformers: that's a killer feature as many models from https://huggingface.co/docs/transformers/index work great out of the box.

At the same time, I found NER models provided by spaCy to have a low accuracy while working with real data: we deal with news articles https://demo.newscatcherapi.com/

Also, while I see how much attention ML models get from the crowd, I think that many problems can be solved with rule-based approach: and spaCy is just amazing for these.

Btw, we recently wrote a blog post comparing spaCy to NLTK for text normalization task: https://newscatcherapi.com/blog/spacy-vs-nltk-text-normaliza...


Also I have an article about spaCy NER: https://newscatcherapi.com/blog/named-entity-recognition-wit...

The conclusion I came up with:

"A few notes on my Spacy NER accuracy with "real world" data

Low accuracy with sentences without a proper casing

1. Low accuracy overall, even with a large model

2. You'd need to fine-tune your model if you want to use it in production

3. Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"


> Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"

Part of it is most underestimate the complexity of NER and the rest of it, in my opinion, is that NER is not well-defined as a classification problem.

At least in my experience, having a specific battery of questions to query documents, first by transformer based semantic search and narrowed by Q/A models, removed the need for explicit NER, entity linking or relation extraction. For the case of entities as features for rule systems, shallow models and using all label predictions instead of just selecting argmax has been sufficiently robust. Using big transformers for classification doesn't pay enough to be worth it there.


I assume your product does some kind of entity disambiguation and/or link to an ontology? Spacy doesn't provide this out of the box either, AFAICT. Can you share more info about how you do it?


We don't provide entity disambiguation out of a box. It's more of a on request for Enterprise clients.

But overall, entity disambiguation is one of the most useful and difficult tasks in the NLP.

SpaCy supports entity linking via knowledge base: https://spacy.io/api/entitylinker


That might be the killer feature from what I've heard.


NER good enough to anonymise free text would be the absolute dream for many governments.


We use spaCy at work for (mostly) news articles as well. We've been pretty impressed with it overall for detecting larger trends using the NER models. I've been contemplating whether it might be useful to make a spaCy module that uses a Count-Min Sketch to track the top N of each of the NER categories partitioned on a daily (or weekly etc.) time.

Think it could be an interesting use case to get sort of similar results to Google's search trends.


I'd really love to chat about that. Any chance to connect? email in bio


I really appreciate how accessible SpaCy has made NLP work but their NER is definitely low accuracy.

Where stem/lem felt critical to successful NLP processing a few years ago, we've found stem/lem work to be much less important for downstream tasks when transformer based models are involved.

For topic extraction stem/lem still seems to do a lot to improve accuracy and for rules based approaches I can still see how it would facilitate more efficient processing at scale. I'd be curious to hear your experience fine tuning and/or training new models after stem/lem processing with transformers, we've admittedly done little testing to see how transformers actually performer if properly tuned to post-processed data.


Did you try something like autoNLP by huggingface?


No, we've got our own fine tuning pipeline and initial tests showed better performance without traditional stem/lem processing so we dropped it from our classification pipelines and haven't seen a need to revisit.


Rule based processing can augment transformers by both filtering out bad input and by parsing good input into a form that plays to the strengths of a model.

You can do some fantastic things with BERT and spaCy, or gpt-neo/J/3, or combinations as needed. Expert systems and ontological tools and things like nltk, spaCy, and LinkGrammar are excellent complements to an ai workflow. Use the fast, "dumb" tools to do the fast, dumb tasks, and only use the huge smart models when you need it.

GPT-3 shouldn't be used if you're just doing tagging or NER, but you can get higher quality nuanced extrapolation or summarization if you run things through a mad libs style prompt generator that leans into prompts that work really well.


Are you using the high accuracy eng model for NER? I’ve been very happy with orgs recognition, it actually did way better than any other open source model in my case.


Try it on a sentence where all tokens are lower/upper case. It just doesn’t really work.


Well, caseless text is a special scenario and not the default scenario. Case is a very strong signal for NER disambiguation, so if you want to support that, then you should apply a special model for that - because if the default model would include support for caseless text, then it would harm the accuracy for all the majority of scenarios where text actually is cased properly.

In essence, the current approaches are targeted for one domain of text over another. You can have a model that works reasonably in one scenario, or a model that works reasonably in another scenario, or an universal model that works poorly in all scenarios and thus is useless unless you really don't know what you're going to be analyzing.

You can support non-literary slang, but that comes at a cost for accuracy on literary languages. You can support multiple variants of language (e.g. for English - British, Indian, AAVE and non-AAVE American) but that comes at a cost of accuracy on any particular variant. You can support text ridden with typos, grammatical mistakes and chat-abbreviations, but that comes at a cost on correct text. The same applies for word casing. So for all of these things you try to support them if and only if you think you need them, since you don't have much of an "accuracy reserve" to sacrifice; the systems generally are barely sufficient for their use for your target domain, and they become not sufficient if you try to make them more general than you need to.

It would be nice if the default models would explicitly list their assumptions, though. Like, a model trained only on correct literary text of one language variant in proper case and not on anything else should clearly state that.


I don’t know how it compares with other paid alternatives (like Google’s or Amazon’s) but spaCy’s NER was pretty close to the (paid) service we were using (IBM) to the point we ditched IBM. Also for news articles.

But yeah disambiguation/entity linking would be nice.


I'd be happy to chat more if you want.


I feel like NER is a poorly designed task in general. You're eventually trying to link the entities to some kind of KB, so you should be injecting that entity information into your system for detecting mentions.


A relatively underdiscussed quirk of the rise of superlarge language models like GPT-3 for certain NLP tasks is that since those models have incorporated so much real world grammar, there's no need to do advanced preprocessing and can just YOLO and work with generated embeddings instead without going into spaCy's (excellent) parsing/NER features.

OpenAI recently released an Embeddings API for GPT-3 with good demos and explanations: https://beta.openai.com/docs/guides/embeddings

Hugging Face Transformers makes this easier (and for free) as most models can be configured to return a "last_hidden_state" which will return the aggregated embedding. Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs) and you're probably good to go.


While you make sensible points, in the case of GPT-3, not everyone will be willing to route their data through OpenAI's servers.

> Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs)

This can still be impractical, at least in my case of regularly needing to process hundreds of pages of text. Simpler systems can be much faster for an acceptable loss and you can get more robustness by working with label distributions instead of just picking argmax.

Fast simpler classifiers can also help decide where the more resource intensive models should focus attention.

Another reason for preprocessing is rule systems. Even if not glamorous to talk about, they still see heavy use in practical settings. While dependency parses are hard to make use of, shallow parses (chunking) and parts of speech data can be usefully fed into rule systems.


I imagine it being very useful to understand what you just said


lol. a rough translation is that the new super language models are good enough that you don't have to keep track of specific parts of speech in your programming. if you look at the arrays of floating point weights that underlie gpt-3 etc, you can use them to match present participle phrases with other present participle phrases and so forth

this is of course a correct and prescient observation. minimaxir is kind of an NLP final boss, so I wouldn't expect most people to be able to follow everything he says


I don't think it's more of a final boss thing: IMO working with embeddings/word vectors is easier, even in the basest case such as word2vec/GloVe, to understand than some of the more conventional NLP techniques (e.g. bag of words/TF-IDF).

The spaCy tutorials in the submission also have a section on word vectors.


Ah, although, TF-IDF is still good to know. Semantic search hasn't eliminated the need for classical retrieval techniques. It can also be used to select a subset of words to use to create an average of word vectors for a document signature, a quick and dirty method for document embeddings.

Bag of word co-occurrences in matrix format is also a nice to know, factorizing such matrices were the original vector space model for distributional semantics and provide historical context for GloVe and the like.


> Bag of word co-occurrences in matrix format is also a nice to know, factorizing such matrices were the original vector space model for distributional semantics and provide historical context for GloVe and the like.

And also, IIRC, still outperforms them on some tasks.


Thank you for making it easier for those in the cheap seats to understand your point!


Readjusting expectations for pre-processing was one of the biggest differences I noticed going from NLP courses to working on NLP in production. For the amount of pre-processing learning material there is, I expected it to be much more important in practice.

I feel lucky to gotten into NLP when I did (learning in 2017/2018 and working in the beginning of 2020). Changing our system from glove to BERT was super exciting and a great way to learn about the drawbacks and benefits of each.


IMHO it's not a difference between courses and production, but rather about the difference between preprocessing needs of different NLP ML approaches.

For some of NLP methods all the extra preprocessing steps were absolutely crucial (and took most of the time in production) and for other NLP methods they are of limited benefit and even harmful - and it's just that in older courses (and many production environments still!) the former methods are used, so the preprocessing needs to be discussed, but if you're using a BERT-like system, then BERT (or something similar) and its subword tokenization effectively becomes your preprocessing stage.


I really love spaCy, it's trivial to throw up a server which handles basic NLP. No complaints here, very happy to see it still being updated


[flagged]


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html


As usual, dang is wrong and not moderating effectively. This is not a shallow comment but a legitimate concern about spaCy, and to a lesser extent other NLP tools such as NLTK. Most of the tooling around them that people end up using really is nothing more than wrappers around other tools. See the default tokenizers or models utilized by these tools.

And yes, even if spaCy is not making money itself, you can bet that the other paid for tools that they sell are.


Actually if the GP had posted this critique instead of a shallow, reductionist internet dismissal ("just want to sell the hype"), that would have been fine. Thoughtful critique is welcome—it just requires higher-quality comments than that.


spaCy sell their service its not free. from https://explosion.ai/about

"In August 2021, we sold 5% of Explosion to SignalFire for $6 million. Employees are given a stake in Explosion using a virtual share bonus program."


NLTK might not be the best example of this--it was originally written for pedagogical purposes[1]. One that was sorely needed, too, as in 2006 it was very difficult for a student in an intro to NLP class to easily track down and use existing implementations of various algorithms. NLTK continues to be useful for similar reasons today, as it provides the only relatively usable interface to some valuable but poorly engineered (by modern standards) academic resources such as FrameNet.

Anyway, original commenter seems to presuppose that there's no value in collating and polishing existing resources. Isn't that all, say, a Linux distro is? Is it true that "[Canonical] just wrap all the open source [code] into a [distro] and just want to sell the hype"? If it seems like curating and polishing has value on the market, maybe we ought to at least entertain the possibility that this value is not an illusion of marketing. That's why I agree with dang that OC is a facile dismissal.

Also worth noting that spaCy isn't just wrappers. spaCy's tokenizer, which is an original work, is used in at least two cutting-edge academic NLP libraries, AllenNLP (@AI2)[2] and Stanza (@Stanford)[3], presumably because using spaCy's tokenizer was better than the alternatives.

[1]: https://aclanthology.org/P06-4018.pdf

[2]: https://github.com/allenai/allennlp/blob/e0ee7f43d5da973e77d...

[3]: https://github.com/stanfordnlp/stanza/blob/68aa42653d656f613...


spaCy makes lots of money. from https://explosion.ai/about "In August 2021, we sold 5% of Explosion to SignalFire for $6 million. Employees are given a stake in Explosion using a virtual share bonus program."


That's a huge part of software development. Wrapping things to be more concise, use-case driven. I mean most software developers are just placing a veneer over something more complex. That's pretty much all we do.


Ah, yes. The tried-and-true method of "just selling the hype" with an open source library that everyone can use for free.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: