It's not that "you are X" type text has to be explicitly in the training data, it's that the model weights interpret "you are X" as an instruction that a human would receive as an emergent behavior after digesting a ton of human written text.
It has to be prompted that it's an AI chatbot first, so its essentially pretending to be a human that is pretending to be an AI chatbot. Back to the point, it interprets instruction as a human would.
If you look under the hood of these chat systems they have to be primed with a system prompt that starts like "You are an AI assistant", "You are a helpful chat bot" etc. They don't just start responding like an AI chatbot without us telling them to.
If you think at most literature, two characters interacting will address each other in the second person. If you think at recipes, most often instructions are addressed to the reader as you.
There's plenty of samples of instructions being given in the second person, and there's plenty samples in literature where using the second person elicits a second person follow-up, which is great for chat model because even if they are still just completing sentences with the most likely token, it gives the illusion of a conversation.
The base model wouldn't do that though, it would just predict the most likely follow up, which could e.g. simply be more instructions. After instruction fine-tuning the model does no longer "predict" tokens in this way.
"And what kind of documents exist that begin with someone saying 'you are X, here are a bunch of rules for how X behaves', followed by a ..."
Where, your parent asks, are all these reams of texts written in this manner ?