State of the art research indicates that joint processing (segmentation, tagging...

syllogism · on April 28, 2014

The Cython parser is basically an implementation of zpar, and achieves slightly faster run-times, and slightly better accuracy (I've added some extra features, and made some nips and tucks).

Note that the Stanford label set has 40 labels, so there are about 80 classes to evaluate. The Penn2Malt scheme has 20 labels, so you need to be careful which dependency scheme is being referenced when run-time figures are reported.

The way the run-time cost works is, if you extract f features per transition, for c classes, with a beam of size k and n words, you make O(cfkn) feature-lookups, which is the main cost.

For the parser.py implementation, most of the speed is coming from greedy search (k=1), and low number of classes (c=3, instead of c=80). Number of feature templates, f, is similar between this parser and zpar. We could add some more templates for label features, and do labelled parsing, and gain about 1% accuracy here, at the cost of being about 40x slower. The only reason I didn't was that it complicates the implementation and presentation slightly. The implementation was all about the blog post.

The Cython parser does everything with C data structures, which are manually memory managed. I don't think I'm paying any language overhead compared to a C++ implementation. So you're absolutely right that as more feature templates stack on, and you use more dependency labels, speed goes down. But, the Cython parser has no problem relative to zpar in this respect.

habeanf · on April 28, 2014

You can do better than O(cfkn). You can precompute the results of some features seen together (i.e. S0w/S0t/S0wt), and store a ref to them on the words. This saves feature lookups per feature. If you store them as vectors for the classes, you can get a nice perf. boost from SIMD.. you can knock off at least an order of magnitude if not two. something like O(c'f'kn), where c' is the # of SIMD ops to aggregate pre computed weight vectors and f' is the number of feature template groups. Also Yoav Goldberg mentioned feature signatures in one of his papers which can do a bit better.

A significant problem with unlabeled dep. parsing is that you can't differentiate important things like subject vs object dependents. In the sentence "They ate the pizza with anchovies.", how would a program distinguish between 'they' as the subject and 'pizza' as the object? In other words, who ate what?

syllogism · on April 28, 2014

Yes, you're probably aware of this paper, right?

Goldberg et al, "Efficient Implementation of Beam-Search Incremental Parsers". ACL 2013. http://www.aclweb.org/anthology/P13-2111

I haven't been able to work out how to do the feature caching in a way that won't ruin my implementation when I need to add more features.

I also get substantial benefit at high k from hashing the "kernel tokens" and memoising the score for the state.

I did try the tree-structured stack that they recommend, but I didn't find any run-time benefits from it, and the implementation kept confusing me. I might have made a mistake, but I suspect it's because my state arrays are copied with low-level malloc/free/memcpy, where they pay Python overhead on their copies.

habeanf · on April 28, 2014

Yes.

I didn't see noticeable improvements from TSS either. I did some performance tuning - much more time goes to feature extraction and scoring. Can you elaborate on what you mean by 'hashing the "kernel tokens" and memoising the score for the state'? Are the kernel tokens something like the head of stack/queue?

For feature caching, I went with a generic model for a feature template as a combination of feature elements (for features like S0t+Q0t+Q1t) that have a closed set, so the feature template is limited to a set that is a cartesian product of the elements' sets. When you initialise parsing for a new sentence, you can select a subset of the possibilities to generate a "submodel" for only that sentence. That way you need much less memory. If you can pack it properly you can get a lot of it into the lower level caches which should allow for significant speed up.

syllogism · on April 28, 2014

Thanks for the explanation of how your cache works. Will you be at ACL this year?

The memoisation I refer to is called here:

https://github.com/syllog1sm/redshift/blob/segmentation/reds...

What happens is, I extract the set of token indices for S0, N0, S0h, S0h2, etc, into a struct SlotTokens. SlotTokens is sufficient to extract the features, so I can use its hash to memoise an array of class scores. Cache utilisation is about 30-40% even at k=8.

While I'm here...

https://github.com/syllog1sm/redshift/blob/segmentation/reds...

The big enum names all of the atomic feature values that I extract, and places their values into an array, context. So context[S0w] contains the word of the token on top of the stack.

I then list the actual features as tuples, referring to those values. So I can write a group of features with something like new_features = ((S0w, S0p), (S0w,), (S0p,)). That would add three feature templates: one with the word plus the POS tag, one with just the word, one with just the POS tag.

A bit of machinery in features.pyx then takes those Python feature definitions, and compiles them into a form that can be used more efficiently.

habeanf · on April 28, 2014

I won't be at ACL this year, maybe next. My advisor will be there though (Reut Tsarfaty).

habeanf · on April 28, 2014

First things first: hats off on a concise, accurate and efficient implementation in python.

But the Cython parser is not doing labeled parsing.. a significant number of feature templates are relevant to labeled parsing.

To reach Nivre and Zhang's 2011 results I think you'll find labeled parsing will degrade performance significantly and you'll need more complex code

1. you have more than an order of magnitude more transitions (80 vs 4)

2. you need to add beam search, with early update for training. beam size = 64 to be consistent with zpar

3. an order of magnitude more feature templates, with more complex features

So now you'll be evaluating 20x more transitions, on ~64x more candidates, with ~10x more feature templates. You can't multithread properly in python, you can maybe get some more juice by using some form of SIMD (w/ or w/o GPU). Your model is much larger in memory, so any luck you had in L2/L3 caching is probably gone. Python has limits and this is the point you start hitting them.

Edit: I see you've updated your comment, I guess we're on the same page. It's nice to see a concise implementation. I wrote one in golang for my masters thesis. I achieved perfect parity with zpar, but it took two order of magnitude more LOC.

syllogism · on April 28, 2014

Sorry for the lack of clarity. The Cython parser results that I quoted uses labelled parsing with beam search. There are various versions in the repo, but there's definitely a configuration that runs the Zhang and Nivre (2011) experiment verbatim, with exactly the same feature templates, and the same results that they get. Replicating those results was essential to implementing the parser correctly.

habeanf · on April 28, 2014

Apologies, could you add yet more clarity? According to the blog post your results are:

Parser Accuracy Speed (w/s) Language LOC

Stanford 89.6% 19 Java > 50,000[1]

parser.py 89.8% 2,020 Python ~500

Redshift 93.6% 2,580 Cython ~4,000

Are these the labeled parsing results you are referring to? How many sents/sec? Using same PTB data sets as Zhang and Nivre '11?

syllogism · on April 28, 2014

Those are Unlabelled Accuracy score results, but Redshift is quietly computing the dependency labels, while parser.py does not. Running Redshift in unlabelled mode gives very fast parse times, but about 1% less accuracy. The labels are really good, both as features and as a way to divide the problem space.

The data sets are the _Stanford_ labels, where the main results in Zhang and Nivre refer to MALT labels. Z&N do provide a single Stanford accuracy in their results, of 93.5% UAS.

Sentences per second should be just over 100. I use k=8 with some extra features referring to words further down the stack, where Z&N use k=64. Right at the bottom of the post, you can find the commit SHA and the commands for the experiment.

zehavoc · on April 29, 2014

Hi, how does this parser compare to clearnlp? it's supposed to be also super fast and very accurate to?