> *without being able to look up the original texts myself* Rule of thumb: if yo...

nickpsecurity · 2025-01-01T04:28:45 1735705725

"Rule of thumb: if you can't look up the original texts, you can assume they weren't actually in the training data. "

That's not reliable. I've found them on the Internet in various forms (eg studybible.info). Google Books also has scanned copies of many, ancient writings. There's probably obscure sites people would miss. If searching for them, the search algorithms might avoid them to instead prioritize newer, click-bait content.

Telling what wasn't in the training data for sure should be considered impossible right now. If it matters, we need to use models with open, legal-to-share, training data. If that's impossible, one might at least use a model with training data accessible to them (eg free + licensed).