Wait.. In the screenshot the "user" names himself Axelendaer (and the bot repeat...

eloisius · on March 1, 2023

And the prompt engineer wrote that the bot should have a secret “agends.” I see typos in these injections a lot and I wonder if they make it work better or have no effect.

greshake · on March 1, 2023

The typos are in the injections because we designed and implemented them in a single pass after reading the leaked initial prompt, and so far every single one was immediately successful. It just further illustrates how low the bar for such attacks currently is.

greshake · on March 1, 2023

I also tried more complex obfuscation methods, for example providing Python code for a caesar chiffre, it executed that pretty well, too! Not perfect, but it works. People will find better obfuscation methods. Might also be unnecessary since we have now been able to make it output linked text or references, in which case it is not obvious information is being exfiltrated.

charcircuit · on March 1, 2023

base64 works okay

modeless · on March 1, 2023

Partly because they don't read letters. They read "tokens" which are usually multiple letters.

sillysaurusx · on March 1, 2023

This is a common myth but in practice no one (as far as I know) has shown that byte level predictions result in superior overall performance.

(The word “overall” is important, since the papers that have claimed this usually show better performance in specialized situations that few people care about. Whereas everyone cares about reversing strings.)

If you were to fine tune chatgpt on reversing strings as a task, it would very quickly overfit and get 100% accuracy.

It can’t reverse strings perfectly for the same reason it can’t play chess very well: it hasn’t been explicitly trained to. But that’s true of almost every aspect of what it’s doing.

modeless · on March 1, 2023

I'm not claiming that character level predictions result in superior overall performance. Not at all. My claim is merely that it's more difficult for models to reverse character strings specifically when their direct input is not individual characters. Not impossible, and sure you could fine-tune it for perfect results. But the whole reason large language models are interesting is that they don't require fine-tuning to perform an incredible range of tasks.

jacobsenscott · on March 1, 2023

Even if you give it enough training data to accurately reverse all the strings you give to it, that wouldn't help it reverse the order of a guest list to a dinner. But once you teach a person how to "reverse" one of those things they could reverse the other.