A number of people in my lab do research into long context evaluation of LLMs for works of fiction. The likelihood is very high that Moby Dick is in the training data. Instead the people in my lab have explored recently published books to avoid these issues.
I’m not involved in the space, but it seems to me that having a model, in particular a massive model, exposed to a corpus of text like a book in the training data would have very minimal impact. I’m aware that people have been able to return data ‘out of the shadows’ pf the training data but to my mind a model being mildly influenced by the weights between different words in this text hardly constitute hard recall, if anything it now ‘knows’ a little of the linguistic style of the authour.
It depends on how many times it had seen that text during training. For example, GPT-4 can reproduce ayats from the Quran word for word in both Arabic and English. It can also reproduce the Navy SEAL copypasta complete with all the typos.
See BooookScore (https://openreview.net/forum?id=7Ttk3RzDeu) which was just presented at ICLR last week and FABLES (https://arxiv.org/abs/2404.01261) a recent preprint.