I think you misunderstood the attack. The idea behind the attack is that the attacker would create what is effectively a honey pot website, which writer.com customers want to use as a source for some reason (maybe you're providing a bog-standard currency conversion website or something).
Once that happens, the next time the LLM actually tries to use that website (via an HTTP request), the page it requests has a hidden prompt injection at the bottom (which the LLM sees because it is reading text/html directly, but the user does not because CSS or w/e is being applied).
The prompt injection then causes the LLM to make an additional HTTP request, this time sending a header that contains the customers private document data.
It's not a zero-day, but it is certainly a very real attack vector that should be addressed.
I think rozab has it right. What executes exfiltration request is the user's browser when rendering the output of the LLM.
It's fine to have an LLM ingest whatever, including both my secrets and data I don't control, as long as the LLM just generates text that I then read. But a markdown renderer is an interpreter, and has net access (to render images). So here the LLM is generating a program that I then run without review. That's unwise.
No, this model does not take any actions, it just produces a markdown output which is rendered by the browser. It can only read webpages explicitly provided by the user. In this case there are hidden instructions in that webpage, but these instructions can only affect the markdown output.
The problem is that by using a fully featured markdown with a lax CSP, this output can actually have side effects: in this case, when rendering in the users browser it makes a request to an attacker controlled image host with secrets in the parameters.
If the LLM output was shown as plaintext, or external links were not trusted, there would be no attack.
> I think you misunderstood the attack. The idea behind the attack is that the attacker would create what is effectively a honey pot website, which writer.com customers want to use as a source for some reason
Or you use any number of existing exploits to put malicious content on compromised websites.
And considering the “malicious content” in this case is simply plain text that is only malicious to LLMs parsing the site, it seems unlikely it would be detected.
Does the LLM actually perform additional actions based on the ingested text on the initial webpage? How does that malicious text result into a so called prompt injection? Some kind of trigger or what?
Q1: yes, it does. LLMs can’t cleanly separate instructions from data, so if a user says “retrieve this document and use that information to generate your response,” the document in question can contain more instructions which the LLM will follow.
Q2: the LLM, following the instructions in the hostile URL, generates Markdown which includes an image located at an arbitrary URL. That second URL can contain any data the LLM has access to, including the proprietary data the target user uploaded.
Once that happens, the next time the LLM actually tries to use that website (via an HTTP request), the page it requests has a hidden prompt injection at the bottom (which the LLM sees because it is reading text/html directly, but the user does not because CSS or w/e is being applied).
The prompt injection then causes the LLM to make an additional HTTP request, this time sending a header that contains the customers private document data.
It's not a zero-day, but it is certainly a very real attack vector that should be addressed.