Hacker News new | past | comments | ask | show | jobs | submit login
How to calculate the alignment between BERT and spaCy tokens (gist.github.com)
80 points by polm23 on July 12, 2021 | hide | past | favorite | 7 comments



This is exactly why tokenizers should provide offsets into the original text! The Spacy tokenizer provides this, though the original wordpiece tokenizer provided for BERT does not. It is relatively easy to add offset information to that tokenizer, though. After you have start and end offsets for each token, it's not hard to just align tokens by offset overlap.


Huggingface tokenizers (e.g. BertTokenizerFast which can load the BERT model vocabularies) also can provide the offsets into the original text.


Can someone explain how BERT and spaCy are typically combined in practice for a custom NER task?


These days, at least in non-research contexts, people seem to mostly use spacy-transformers for that, already contains all the necessary glue code between huggingface transformers (including BERT) and spacy.

https://explosion.ai/blog/spacy-transformers / https://spacy.io/universe/project/spacy-transformers


Does anyone know how those beautiful "terminal" images were generated? Or recommend some resources for how to make similar images? Or how to actually configure iterm2/vim/etc to that point?



Clicked on this thinking it was some kind of crazy decentralized crypto algorithm... yeesh, I gotta spend less time on the blockchain!!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: