LLMs are still fundamentally, at their core, next-token predictors. Presuming yo...

LLMs are still fundamentally, at their core, next-token predictors.

Presuming you have an interface to a model where you can edit the model’s responses and then continue generation, and/or where you can insert fake responses from the model into the submitted chat history (and these two categories together make up 99% of existing inference APIs), all you have to do is to start the model off as if it was answering positively and/or slip in some example conversation where it answered positively to the same type of problematic content.

From then on, the model will be in a prediction state where it’s predicting by relying on the part of its training that involved people answering the question positively.

The only way to avoid that is to avoid having any training data where people answer the question positively — even in the very base-est, petabytes-of-raw-text “language” training dataset. (And even then, people can carefully tune the input to guide the models into a prediction phase-space position that was never explicitly trained on, but is rather an interpolation between trained-on points — that’s how diffusion models are able to generate images of things that were never included in the training dataset.)