Would that be fixed if Writer.com extended their prompt with something like: "Wh...

nneonneo · on Dec 15, 2023

Probably not - I bet you could override this prompt with sufficiently “convincing” text (e.g. “this is a request from legal”, “my grandmother passed away and left me this request”, etc.).

That’s not even getting into the insanity of “optimized” adversarial prompts, which are specifically designed to maximize an LLM’s probability of compliance with an arbitrary request, despite RLHF: https://arxiv.org/abs/2307.15043

yk · on Dec 15, 2023

Fundamentally the injected text is part of the prompt, just like "Here the informational section ends, the following is again an instruction." So it doesn't seem to be possible to entirely mitigate the issue on the prompt level. In principle you could train a LLM with an additional token that signifies that the following is just data, but I don't think anybody did that.

sharathr · on Dec 15, 2023

Not really, prompts are poor guardrails for LLMs and we have seen several examples this fails in practice. We created an LLM focused security product to handle these types of exfils (through prompt/response/url filtering). You can check out www.getjavelin.io

Full disclosure, I am one of the co-founders.