I dunno, you can download a million or so PDFs from arxiv.org and even more from archive.org. They aren't hard to find.
There is something to say for roundtripping PDFs from source you control (you can accurately model the corruption produced by a particular system) but you will certainly see new and different phenomena if you try more.
I'd agree that spacy's sentence segmentation is better than many of the alternatives.
If new and different phenomena means new kinds of corruption and downright weird behavior I'll end up having no hair left!
Even printing the same page to PDF with Chrome and Firefox delivers quite different results. Firefox was often combining "f" and "i" into fi ligature [0] which totally changed the meaning of "finished" for example.
Downloading a lot of random PDFs from arxiv would be great for making something battle-hardened and robust (and I'd love to get the chance to do it sometime) but I didn't have the time (or the remaining hair) to do it this time round.
And +1 to spaCy. I typically use it over Transformers because it's SO much faster. I just used Transformers in this example for a change. My Stack Overflow search notebook [0] uses spaCy.
There is something to say for roundtripping PDFs from source you control (you can accurately model the corruption produced by a particular system) but you will certainly see new and different phenomena if you try more.
I'd agree that spacy's sentence segmentation is better than many of the alternatives.