Hacker News new | past | comments | ask | show | jobs | submit login

It's not about debuggability, prompt injection is an inherent risk in current LLM architectures. It's like a coding language where strings don't have quotes, and it's up to the compiler to guess whether something is code or data.

We have to hope there's going to be an architectural breakthrough in the next couple/few years that creates a way to separate out instructions (prompts) and "data", i.e. the main conversation.

E.g. input that relies on two sets of tokens (prompt tokens and data tokens) that can never be mixed or confused with each other. Obviously we don't know how to do this yet and it will require a major architectural advance to be able to train and operate at two levels like that, but we have to hope that somebody figures it out.

There's no fundamental reason to think it's impossible. It doesn't fit into the current paradigm of a single sequence of tokens, but that's why paradigms evolve.




I think the reason we've landed on the current LLM architecture (one kind of token) is actually the same reason we landed on the von Neumann architecture: it's really convenient and powerful if you can intermingle instructions and data. (Of course, this means the vN architecture has exactly the same vulnerabilities as LLM‘s!)

One issue is it's very hard to draw the distinction between instructions and data. Are a neural net’s weights instructions? (They're definitely data.) They are not literally executed by the CPU, but in a NN of sufficient complexity (say, in a self driving car, which both perceives and acts), they do control the NN’s actions. An analogous and far more thorny question would be whether our brain state is instruction or data. At any moment in time our brain state (the locations of neurons, nutrients, molecules, whatever) is entirely data, yet that data is realized, through the laws of physics/chemistry, as instructions that guide our bodies’ operation. Those laws are too granular to be instructions per se (they're equivalent to wiring in a CPU). So the data is the instruction.

I think LLMs are in a similar situation. The data in their weights, when it passes through some matrix multiplications, is instructions on what to emit. And there's the rub. The only way to have an LLM where data and instruction never meet, in my view, is one that doesn't update in response to prompts (and therefore can't carry on a multi prompt conversation). As long as your prompt can make even somewhat persistent changes to the model’s state — its data — it can also change the instructions.


> The only way to have an LLM where data and instruction never meet, in my view, is one that doesn't update in response to prompts (and therefore can't carry on a multi prompt conversation).

Do you mean an LLM that doesn't update weights in response to prompts? Doesn't GPT-4 not change its weights mid conversation at all (and instead provides the entire previous conversation as context in every new prompt)?


No, use an encoder/decoder transformer, for example: prompt goes on encoder, is mashed into latent space by encode, then decoder iteratively decodes latent space into result.

Think like how DeepL isn't in the news for prompt injection. It's decoder-only transformers, which make those headlines.


I think it's very plausible but it would require first a ton of training data cleaning using existing models in order to be able to rework existing data sets to fit into that more narrow paradigm. They're so powerful and flexible since all they're doing is trying to model the statistical "shape" of existing text and being able to say "what's the most likely word here?" and "what's the most likely thing to come next?" is a really useful primitive, but it has its downsides like this.


>There's no fundamental reason to think it's impossible

There is, although we don't have a formal proof of it yet. Current LLMs are essentially Turning complete, in that they can be used to simulate any arbitrary Turing machine. This makes it impossible to prove an LLM will never output a certain statement for any possible input. The only way around this would be making a "non-Turing-complete" LLM variant, but it would necessarily be less powerful, much as non-Turing-complete programming languages are less powerful and only used for specialised tasks like build systems.


"Non-Turing-complete" still leaves you vulnerable to the user plugging into the conversation a "co-processor" "helper agent". For example if the LLM has no web access, it's not really difficult - just slow - to provide this web access for it and "teach" it how to use it.


Couldn't you program the sampler to not output certain token sequences?


Yeah. E.g. GPT-4-turbo's JSON-mode seems to forcibly block non-JSON-compliant outputs, at least in some way. They document that forgetting to instruct it to emit JSON may lead to producing whitespace until the output length limit is reached.

In related info, there is "Guiding Language Models of Code with Global Context using Monitors" ( https://arxiv.org/abs/2306.10763 ), which essentially gives IDE-typical type-aware autocomplete to an LLM to primarily study the scenario of enforcing type-consistent method completion in a Java repository.


That’s seems extremely difficult if not impossible. There’s a million ways an idea can be conveyed in language.


Would training data injection be the next big threat vector with the 2 tier approach?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: