I imagine it is difficult (to say the least) to cover off the entire space of ma...

barking_biscuit · on March 1, 2023

I would also imagine that once enough public examples of jailbreaks become available that you could use that to train or fine-tune a model for generating novel jailbreaks. Though perhaps you could do the same to detect novel jailbreaks. Hmm looks like I've just reinvented GANNs.

simonw · on March 1, 2023

Here's why I don't think you can solve prompt injection by training another model: https://simonwillison.net/2022/Sep/17/prompt-injection-more-...

brucethemoose2 · on March 1, 2023

"malicious" fine-tunes are a huge general concern of mine. For instance:

- SEO llms

- Image/text generation tuned on audience engagement

- code exploit generating llms

- llms trained to avoid spam filters

"countermodels" for a single malicious model are doable, but I think the problem is intractable if training is easy and there are thousands of finetunes floating around.

barking_biscuit · on March 1, 2023

To some extent this is already happening. Or, rather, we've begun doing it to ourselves. At least, in the case of Stable Diffusion, it seems like there is a non-trivial portion of people who are using it to train models for the purpose of generating porn specific to their likes/interests. Which is all fine and dandy, right? Except for the fact that a significant portion of people are actually addicted to it already due to the variety, availability and how it affects the reward system. Couple that with the slot-machine like nature of the variable rewards thrown up by Stable Diffusion, and it's ability to generate higher volumes of stuff that is to your liking, and it's not hard to imagine it will do a real number on some people in the long-run.

lmm · on March 1, 2023

WTF does that have to do with the topic? People are training these models to produce stuff they like, and they're producing stuff they like. That's not a malicious fine-tune, quite the opposite.

brucethemoose2 · on March 1, 2023

True. But I also put that into a different category than models used for malicious intent against other people.

dragonwriter · on March 1, 2023

> Except for the fact that a significant portion of people are actually addicted to it

What’s the basis for this claim?