Wow, I just used the AllenNLP mentioned here and it is quite amazing! I took a random article from Google news which happen to be about Flynn's FBI criticism. I asked a couple question like "who is going to jail" or "who is leading the investigation" and it worked flawlessly. The article is only around 15 sentences too!
A few years ago I read the novel "Galatea 2.2" by Richard Powers which is all about training a neural network to do just that and thought "now this is some bullshit".
I love textacy, it has soo much out of the box. Topic modeling, topic extraction, summarization, and its built on top of Spacy. https://github.com/chartbeat-labs/textacy
If anyone from the dev team there, can you look into integrating the "make the research into production" part into allennlp. Facebook currently has fairseq, this and other nlp repos. Allennlp makes it easier to model most classes of NLP problems with a clean dependency injectable interface with most common tasks abstracted out cleanly.
AllenNLP dev here. We're going to do a "PyTorch 1.0" release of AllenNLP next week, and then after that we're planning to investigate how to incorporate the new "production" aspects.
Could you guys elaborate on the relationship between PyText, torchtext, and AllenNLP? I've briefly used the latter two, but with how quickly things are moving it'd be nice to have a quick answer from the devs themselves.
PyText dev here, Torchtext provides a set of data-abstractions that helps reading and processing raw text data into PyTorch tensors, at the moment we use Torchtext in PyText for training-time data reading and preprocessing.
AllenNLP is a great NLP modeling library that is aimed at providing reference implementations and prebuilt state-of-the-art models, and make it easy to iterate on and research with models for different NLP tasks.
We've built PyText to be a rich NLP modeling library (along the lines of AllenNLP) but with production capabilities baked in the design from day 1.
Examples are:
- We provide interfaces to make sure data preprocessing can be consistent between training and runtime
- The model interfaces are compatible with ONNX and torch.jit
- A core goal for us in the next few month is to be able to run models trained in PyText on mobile.
Among other differences like supporting distributed training and multi-task learning.
That being said, so far our library of models has been mostly influenced by our current production use-cases, we are actively working on enriching this library with more models and tasks while keeping production capabilities and inference speed in mind.
AllenNLP is great, and influenced the design of PyText in several ways. There are some central design decisions of AllenNLP that make it incompatible with PyTorch's jit tracing and so make productionizing models require much more manual work. It also generally leaves preprocessing up to the user, so preprocessing consistently between training and inference is outside the scope of what AllenNLP does.
What is a good data structure for holding your parsed corpus? Ideally I'd like to be able to count number of sentences, paragraphs, average word counts for these and easily do queries such as "nouns that fit this regex" or "POS that precedes a named entity"
I've been looking at Spacy, but as far as I can tell it is hard coded to use universal parts of speech.
Edit: Super wow, the documentation is amazing as well (https://allennlp.org/tutorials).