I dunno, you can download a million or so PDFs from arxiv.org and even more from...

alexcg1 · on July 25, 2022

If new and different phenomena means new kinds of corruption and downright weird behavior I'll end up having no hair left!

Even printing the same page to PDF with Chrome and Firefox delivers quite different results. Firefox was often combining "f" and "i" into ﬁ ligature [0] which totally changed the meaning of "finished" for example.

Downloading a lot of random PDFs from arxiv would be great for making something battle-hardened and robust (and I'd love to get the chance to do it sometime) but I didn't have the time (or the remaining hair) to do it this time round.

[0] https://www.compart.com/en/unicode/U+FB01

alexcg1 · on July 25, 2022

And +1 to spaCy. I typically use it over Transformers because it's SO much faster. I just used Transformers in this example for a change. My Stack Overflow search notebook [0] uses spaCy.

[0] https://colab.research.google.com/github/jina-ai/workshops/b...