A lot of full text for research (outside CS) is still locked up behind subscription paywalls. Plus, often times PDFs are not the best format to extract text out of.
Interesting suggestion but probably a lot of practical limitations.
That's also a blessing in disguise though, we don't want an LLM trained on closed source stuff, so ignoring sources it can't access is probably a great thing
Interesting suggestion but probably a lot of practical limitations.