How to calculate the alignment between BERT and spaCy tokens

cochne · on July 13, 2021

This is exactly why tokenizers should provide offsets into the original text! The Spacy tokenizer provides this, though the original wordpiece tokenizer provided for BERT does not. It is relatively easy to add offset information to that tokenizer, though. After you have start and end offsets for each token, it's not hard to just align tokens by offset overlap.

PeterisP · on July 15, 2021

Huggingface tokenizers (e.g. BertTokenizerFast which can load the BERT model vocabularies) also can provide the offsets into the original text.

tracyhenry · on July 13, 2021

Can someone explain how BERT and spaCy are typically combined in practice for a custom NER task?

tastroder · on July 13, 2021

These days, at least in non-research contexts, people seem to mostly use spacy-transformers for that, already contains all the necessary glue code between huggingface transformers (including BERT) and spacy.

https://explosion.ai/blog/spacy-transformers / https://spacy.io/universe/project/spacy-transformers

fouc · on July 13, 2021

Does anyone know how those beautiful "terminal" images were generated? Or recommend some resources for how to make similar images? Or how to actually configure iterm2/vim/etc to that point?

polm23 · on July 13, 2021

https://carbon.now.sh/

occupy_paul_st · on July 13, 2021

Clicked on this thinking it was some kind of crazy decentralized crypto algorithm... yeesh, I gotta spend less time on the blockchain!!