Show HN: A NLP Library for Matching Parse Trees

thesoonerdev · on July 9, 2016

If you clicked through to the two links from laretluval and brudgers, you can see why natural language processing as a field is struggling to gain quick adoption (in proportion to well understood concepts). Look at laretluval's links: The people who are doing the hardcore research are doing a really poor job of explaining what exactly they are trying to accomplish. Can a programmer who is good at programming but not familiar with computer science concepts actually figure out what exactly tregex does, even after reading the page a few times? Do you seriously expect someone to download a PPT file (yes, ppt, not a pdf) to understand the basics?

Contrast that with brudgers link - it is actually a readable summary even though I personally think the person who posted that blog entry still needs to learn more concepts in NLP/English grammar/hierarchical data structures to scale the project - all his examples are active voice - using regex will fail as the sentence becomes more run on like the one you are currently reading - hand crafting rules for English grammar is actually super hard because even trained linguists sometimes disagree on the parse tree produced by fairly short sentences (I think I learnt that from watching a YouTube video by Chris Manning, unfortunately I don't have the reference right now)

I don't understand how the NLP community seems so oblivious to this issue.

glup · on July 9, 2016

Alternative explanation: NLP is lagging behind what we expect because it is a very challenging domain. Explanations of NLP concepts are nontrivial because fairly complex computational tools are required to get anywhere— smoothed n-gram models, PCFGs, LDA topic models, etc. Like computer vision, it requires a combination of statistics, computer science (runtimes and data structures), and an understanding of the target domain. To understand the basics may require taking an NLP class, reading Jurafsky and Manning, and looking at quite a few lectures (which, yes, are occasionally distributed as PDFs).

thesoonerdev · on July 9, 2016

Fair enough. My concern is whether the practitioners of NLP are aware of this perception - and it seems like they are.

laretluval · on July 9, 2016

I agree with your criticism in general, but the lack of outsider-friendly explanation here seems justified because something like a parse tree matcher is more of a tool that's useful inside the NLP research community than for end users. When does an end user ever need to find trees with a particular syntactic structure? On the other hand, it is very useful for debugging parsers, verifying annotation standards in corpora, etc.: things that NLP researchers have to do.

There is some NLP software that does a great job of explaining what it does, how it does it, and why this is useful. http://spacy.io/ comes to mind. Maybe that's the happy exception.

youngprogrammer · on July 10, 2016

I would argue that parse tree matching is useful to end users (programmers) who are trying to extract some meaning from a sentence. By matching a parse tree, you can match and extract the contextual information about a sentence, e.g. the subject, action and object. Compare this to the intent matching machine learning approach where you feed a model some sample sentences and manually tag the sentences (e.g. wit.ai). You feed a sentence to this black box and you might get the right intent and context matching but you don't know what's going on and you have no control over the matching. Manually create rules for matching parse trees is a little more work than manually tagging sentences but it allows for more control and transparency over how the matching is done.

thesoonerdev · on July 9, 2016

Spacy's website looks good. Thanks for the heads up.

syllogism · on July 9, 2016

Unfortunately the two demos 404 at the moment. I hope we can have everything back online soon.

youngprogrammer · on July 10, 2016

Hey thesoonerdev,

I am the person who posted the blog entry. It's true that I don't know much about how NLP parsing works but I do know how the parse trees are structured. I believe matching parse trees is scalable. The examples in my post were for short imperative commands but it is relatively simple to create rules for more complex sentences. It might not be perfect in every case but I would say for a majority of cases it works well.

I'm glad you were able to understand the blogpost and I agree that the material on tregex is not clear and would be difficult to pick up. I hope the library I wrote will let programers start using the Stanford parsing libraries more easily.

laretluval · on July 9, 2016

Similar software for matching parse trees:

Tregex: http://nlp.stanford.edu/software/tregex.shtml tgrep2: http://tedlab.mit.edu/~dr/Tgrep2/

brudgers · on July 9, 2016

Related blog post: http://blog.ayoungprogrammer.com/2016/07/natural-language-un...

chatmasta · on July 11, 2016

This would be really cool to apply to programming languages. That is, matching abstract syntax trees together.

This way you could identify similar chunks of code. I had an idea related to this for identifying security vulnerabilities: https://news.ycombinator.com/item?id=11573547

bpodgursky · on July 10, 2016

Nice. I started on an impl in Java several years ago but never got far (https://github.com/bpodgursky/nlpstore/blob/master/src/test/...).