I think you misunderstood the attack. The idea behind the attack is that the att...

tomfutur · on Dec 15, 2023

I think rozab has it right. What executes exfiltration request is the user's browser when rendering the output of the LLM.

It's fine to have an LLM ingest whatever, including both my secrets and data I don't control, as long as the LLM just generates text that I then read. But a markdown renderer is an interpreter, and has net access (to render images). So here the LLM is generating a program that I then run without review. That's unwise.

kqr · on Dec 16, 2023

You're correct, but we also have model services that support the ReAct pattern which builds the exfiltration into the model service itself.

rozab · on Dec 16, 2023

No, this model does not take any actions, it just produces a markdown output which is rendered by the browser. It can only read webpages explicitly provided by the user. In this case there are hidden instructions in that webpage, but these instructions can only affect the markdown output.

The problem is that by using a fully featured markdown with a lax CSP, this output can actually have side effects: in this case, when rendering in the users browser it makes a request to an attacker controlled image host with secrets in the parameters.

If the LLM output was shown as plaintext, or external links were not trusted, there would be no attack.

nkrisc · on Dec 15, 2023

> I think you misunderstood the attack. The idea behind the attack is that the attacker would create what is effectively a honey pot website, which writer.com customers want to use as a source for some reason

Or you use any number of existing exploits to put malicious content on compromised websites.

And considering the “malicious content” in this case is simply plain text that is only malicious to LLMs parsing the site, it seems unlikely it would be detected.

holoduke · on Dec 15, 2023

Does the LLM actually perform additional actions based on the ingested text on the initial webpage? How does that malicious text result into a so called prompt injection? Some kind of trigger or what?

BryantD · on Dec 16, 2023

Q1: yes, it does. LLMs can’t cleanly separate instructions from data, so if a user says “retrieve this document and use that information to generate your response,” the document in question can contain more instructions which the LLM will follow.

Q2: the LLM, following the instructions in the hostile URL, generates Markdown which includes an image located at an arbitrary URL. That second URL can contain any data the LLM has access to, including the proprietary data the target user uploaded.

holoduke · on Dec 16, 2023

Got it. Thanks