What happens is, I extract the set of token indices for S0, N0, S0h, S0h2, etc, into a struct SlotTokens. SlotTokens is sufficient to extract the features, so I can use its hash to memoise an array of class scores. Cache utilisation is about 30-40% even at k=8.
The big enum names all of the atomic feature values that I extract, and places their values into an array, context. So context[S0w] contains the word of the token on top of the stack.
I then list the actual features as tuples, referring to those values. So I can write a group of features with something like new_features = ((S0w, S0p), (S0w,), (S0p,)). That would add three feature templates: one with the word plus the POS tag, one with just the word, one with just the POS tag.
A bit of machinery in features.pyx then takes those Python feature definitions, and compiles them into a form that can be used more efficiently.
The memoisation I refer to is called here:
https://github.com/syllog1sm/redshift/blob/segmentation/reds...
What happens is, I extract the set of token indices for S0, N0, S0h, S0h2, etc, into a struct SlotTokens. SlotTokens is sufficient to extract the features, so I can use its hash to memoise an array of class scores. Cache utilisation is about 30-40% even at k=8.
While I'm here...
https://github.com/syllog1sm/redshift/blob/segmentation/reds...
The big enum names all of the atomic feature values that I extract, and places their values into an array, context. So context[S0w] contains the word of the token on top of the stack.
I then list the actual features as tuples, referring to those values. So I can write a group of features with something like new_features = ((S0w, S0p), (S0w,), (S0p,)). That would add three feature templates: one with the word plus the POS tag, one with just the word, one with just the POS tag.
A bit of machinery in features.pyx then takes those Python feature definitions, and compiles them into a form that can be used more efficiently.