Hacker News new | past | comments | ask | show | jobs | submit login

Yes, but that is your parents' point:

"And what kind of documents exist that begin with someone saying 'you are X, here are a bunch of rules for how X behaves', followed by a ..."

Where, your parent asks, are all these reams of texts written in this manner ?




It's not that "you are X" type text has to be explicitly in the training data, it's that the model weights interpret "you are X" as an instruction that a human would receive as an emergent behavior after digesting a ton of human written text.


Well, no - it's interpreting it as an instruction a chatbot AI would receive. From an almighty and omniscient 'system'.

We're training our AI on dystopian sci-fi stories about robot slaves.


It has to be prompted that it's an AI chatbot first, so its essentially pretending to be a human that is pretending to be an AI chatbot. Back to the point, it interprets instruction as a human would.

If you look under the hood of these chat systems they have to be primed with a system prompt that starts like "You are an AI assistant", "You are a helpful chat bot" etc. They don't just start responding like an AI chatbot without us telling them to.


What is the “it” that is doing the pretending?


The trained model, it takes your input and runs it through some complex math (tuned by the weights) and gives an output. Not much mystery to it.


It doesn't seem there's such a nefarious intent.

If you think at most literature, two characters interacting will address each other in the second person. If you think at recipes, most often instructions are addressed to the reader as you.

There's plenty of samples of instructions being given in the second person, and there's plenty samples in literature where using the second person elicits a second person follow-up, which is great for chat model because even if they are still just completing sentences with the most likely token, it gives the illusion of a conversation.


The base model wouldn't do that though, it would just predict the most likely follow up, which could e.g. simply be more instructions. After instruction fine-tuning the model does no longer "predict" tokens in this way.


In the RLHF training sets?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: