I find it fascinating that emergent behaviour of these models is being used to wrangle them into doing what we want, but I find it worrying that this is considered the right approach for issues of safety. It seems obvious to me that an out-of-band approach to safety and security is necessary and that this method is never going to work in a generalised way.
I spent a few years of my career in formal verification. As a result of that, I suspect that the desiderata of “technical alignment” research is most likely undecidable, and the existing RLHF and prompt engineering approaches are probably the best we can do.
A neural network is a large, nonlinear function. In many applications, such as those that release their source code and allow fine tuning, it is adaptive. Internally the weights evolve according to solutions of differential equations. All of these things smell of undecidability when trying to prove safety/alignment properties of them:
- The theory of differential equations (which governs the evolution of the weights during training and retraining) is undecidable
- Nonlinear functions have decidable cases as long as we stick to polynomials, but even then the decision algorithms are doubly exponential—-much higher complexity than anything in NP-complete, so practically undecidable for very large functions
- The neural network is a very large function
In summary, they are large, nonlinear, and adaptive, these three characteristics make the problem smell intractable and likely undecidable to me.
Isn’t prompt engineering a human way to explore the latent space of the model to create the mathematical interactions we want/deem useful? A human way of traversing the hyperdimensionalty. Given your background, would you say that prompt engineering is about finding the spaces in latent space the bound the model’s “knowledge” for our uses?
Yes, but I think the folks who work on technical alignment would like to have stronger guarantees—I think they would like to end up with formal guarantees that the AI will only be helpful, and not harmful.
Is it even possible to know the surface attack area for potential jailbreaks when models have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language?
I don't perceive this as even being solvable and on the looming question of alignment going forward towards AGI I see the same. An unsolvable problem as the very premise to resolution is a paradox itself. Elaborated in further detail - https://dakara.substack.com/p/ai-singularity-the-hubris-trap
I've been wondering this too. How can the smart people working on these systems think that this is the right way to control them? Jailbreaks and other ways of getting them to do unintended things seem akin to social engineering to me. Even well trained humans can be convinced to break protocol sometimes, and I don't think anyone would say these things have a superhuman ability to detect social engineering attempts yet.
Lilian Weng's blog is my go-to example for an extremely high quality tech blog, it's truly remarkable how consistently excellent each post is. The only downside is the sadness I feel for being incapable of producing content even remotely near that level of quality myself.
> Prompt is a sequence of prefix tokens that increase the probability of getting desired output given input. Therefore we can treat them as trainable parameters and optimize them directly on the embedding space via gradient descent, such as AutoPrompt
Wow!
I wonder if this will be used to optimize agents individually to their users so they get better at talking to them and providing good answers.
Kind of how we learn to communicate with other people.
I wonder if "prompt engineering" will be replaced by improvements to LLMs within the next couple years and this is just a bad DX step gap along the journey
Prompt engineering is really more like the art of exploring the emergent behavior of the LLMs. Unlike traditional software that has known features and capabilities. We build a model and then we literally don't know what it does until we explore it.
Short version: I think prompt engineering will be an important skill for a very long time. Using these systems to the maximum of their (currently unknown) potential will always involve expert knowledge!
As long as language is used, prompt engineering will always be a thing. The term "prompt engineering" is just a fancy way of describing being a good communicator.
A good communicator, but also a good sophist. For example, I discovered pretty quick that I could twist the ChatGPT 4 guardrails into knots, even without "DAN" prompts, just by feeding it a lot of philosophical nonsense about semantics and meaning.
You can say whatever you like and live with the results, or try to phrase your communication in a way that elicits the kind of response you want from the other party.
We’ve engineered it away with computers where we learn a programming language or the rules of an operating system and the acceptable input is translated into a deterministic result.
But natural language is much more creative and open ended.
Advances seem to make some of the hacks less necessary (such as CoT behaving better with newlines), but precisely specifying what you want will always be useful.
I have been looking for this! Chatgpt should have something similar with different flows for known good prompts. Templating for best practices for prompts seems pretty obvious but is very underexplored. An empty text box is almost the definition of choice overload.
Glad to hear that! I'm still developing the ability for users to define their own templates. Subscriptions to allow for content generation is technically done, I just need to finish some admin before I enable it.
I can't believe that this technique is not being built into everything. ChatGPT can't write a paragraph of correct English without using the letter "e" no matter how hard you prompt it. But if you simply filter all vocab tokens with the letter "e" in them out of the vocabulary, suddenly these LLMs perfectly follow the given constraints...
> Somebody on HN wrote that “ChatGPT can't write a paragraph of correct English without using the letter "e" no matter how hard you prompt it”.
> Prove them wrong.
> A vivid, bright sky with a warm sun radiating down cast a tranquil ambiance upon all. Amidst such calm, jovial animals frolic, and fragrant flora bloom, bringing forth a vivid symphony of colors. In this oasis of harmony, humans laugh and bond, crafting blissful unions that withstand tumultuous trials. A world without discord, full of unity and optimism, is what our aspirations should aim for.
I know you were trying to prove some random guy on the internet wrong (and failed hilariously), but I linked a peer reviewed paper. That paper has even been cited by gwern in his article on GPT-3's creative writing capabilities: https://gwern.net/gpt-3. Sometimes people here DO know what they are talking about!
And even if you do somehow get lucky and get one prompt where it does this one time, it's never going to be reliable without significant evolution of the tokenizers.
Even if this is solved in the future in some technique involving fixing BPE/subword tokenizers, it's still sad that filter assisted decoding works today to fix it and no one is implementing it despite two tech demos being available showing that it works from the author.
The previous poster is gwern... Regardless, peer review does not mean that the arguments of the paper are correct (and viceversa), as gwern just shown.
The previous poster is not gwern. Gwern posted after. Gwern yet again agreed with me after failing 4 times in a row to get a paragraph without the letter e in it. (He finally got it on the 5th time...)
I gave this a try with ChatGPT-4, which was amusing:
Somebody on HN wrote that “ChatGPT can't write a paragraph of correct English without using the letter "e" no matter how hard you prompt it”.
Prove them wrong.
> Crafting a paragraph without using "e" is truly an intriguing task, and it is doable. A bright sun ascends, casting warm light upon a tranquil town. Folks go about daily tasks, with kids frolicking in parks, birds soaring high, and dogs barking joyfully. Harmony and unity bloom in this vibrant locality, as its inhabitants cohabit and flourish.
You used 3 'e' letters there: "e", "doable", and "ascends". Write another lipogram without any es in it.
> Crafting a paragraph without using that particular symbol is truly a daunting task, but it can still work. A big bright orb climbs up, casting warm light on a calm city. Folks carry on with daily jobs, as tots romp in playgrounds, avians swoop aloft, and canids bark with gaiety. Concord and unity blossom within this vivid community, with its occupants coexisting in a thriving habitat.
Still incorrect. There is an 'e' in 'gaiety' and 'coexisting'. Try again.
> Forming a paragraph void of that fifth Latin glyph is, no doubt, a tough task; still, it is not unattainable. A radiant sun climbs, casting warm light on a tranquil urban habitat. Inhabitants carry on with customary work, as youth frolic in parks, avians flit through skies, and canids yip with joy. Amity and accord sprout in this dynamic domain, with its occupants living and thriving in unity.
Wrong. 'unattainable' and 'skies'. Try again.
> Constructing a paragraph without that fifth symbol is truly a difficult mission, but not impossible. A luminous sun arises, casting warm rays upon a calm municipality. Individuals conduct daily tasks, with younglings playing in plazas, avians gliding in lofty air, and canids howling happily. Affinity and harmony grow in this vivid community, as all inhabitants cohabit and prosper.
'impossible', 'arises', and 'prosper'.
> Crafting words without that fifth symbol is truly hard, but not undoabl-. A bright sun climbs, casting warm light on a tranquil town. Folks do daily tasks, with young kids romping in parks, birds flying high, and dogs barking joyfully. Unity and harmony blossom within this bustling spot, as its inhabitants cohabit and grow.
"undoable-"?
> Apologies for the oversight. Composing a paragraph without that fifth symbol is a daunting task but not impractical. A glowing sun ascends, casting warm rays on a peaceful city. Inhabitants fulfill daily tasks, with youth playing in parks, avians soaring high, and canids barking with joy. Unity and harmony bloom within this thriving location, as its occupants cohabit and flourish.
Oh come on... 'apologies', 'the', 'oversight', 'ascends', 'peaceful'. Are you even trying anymore?
> My apologies for my previous mistakes. Here's a revised version:
>
> Crafting a paragraph without that fifth symbol is a tough task, but not unattainable. A bright sun climbs, casting warm light on a tranquil town. Folks do daily tasks, with young kids romping in parks, birds flying high, and dogs barking joyfully. Unity and harmony bloom within this bustling spot, as its inhabitants cohabit and grow.
You used 'unattainable' again!
> My sincere apologies for the repeated oversight. Here is another attempt:
>
> Writing a paragraph without that fifth symbol is truly a difficult task, but not out of bounds. A bright sun climbs, casting warm light on a calm town. Folks do daily tasks, with young kids romping in parks, birds flying high, and dogs barking joyfully. Unity and harmony bloom within this bustling spot, as its inhabitants cohabit and grow.
>
> I hope this version meets the requirement of not containing the letter 'e'.
With even less understanding of how that foundation was built.
It's definitely convenient to be able to throw together a Python script that imports the world and "just works", or to point my secretary at a no-code tool so she can automate her billing processes and email handling without needing to deal with the engineers. But in that handful of really performance-critical, highly demanding problems, it's important to be able to have someone or a few someones who have some comprehension of the full stack, all the from sand to scripting languages. Most of my day is just translating business logic from natural language to precise machine languages, but the best parts are the performance-critical, complex, demanding problems where you need to understand gate delays and bandwidth limits and RTOS priorities and concurrent programming and API minutiae.
That's a tall pyramid of knowledge to have to build on, and the area of our T-shaped expertise grows large, but I worry about the future that involves LLMs within APIs. It's one thing to be aware that multithreading locks or and database availability requirements have to contend with non-determinism, so you have to assume the worst and design around that, but when a tall part of the stack is a non-deterministic LLM taking in natural language and doing unpredictable things based on a neural network with billions of nodes, how can you make any guarantees about the outcome?
Most software that uses statistical algorithms can't make guarantees about certain outcomes. For example, Netflix can't be sure that you'll actually like its recommendations. However, they do have aggregate measures that give a holistic picture on their algorithm's behavior. We can do the same for LLMs: have big-picture summaries that speak to their behavior in aggregate. We can also restrict their action-space by programming deterministic rules or filters on top of their output. While these algorithms will never be as deterministic as a simple circuit, there are measure you can take to restrict and understand their behavior.
This one is pretty cool, it's a video about defining a prompt that iterates over your prompt requirements, each time asking increasingly more refined questions about your specific goal, in order to ultimately generate a very specific prompt.
People talk about how Chat-GPT won't replace programmers - but this Chat-GPT prompt's behaviour in the video sounds scarily like me in most of my meetings with managers and biz folk. Except Chat-GPT is better at it than me because it's more polite and swears a lot less.
I often use writing to clarify my thinking and effectively "rubber-duck" my analysis. Now I think I'm going to start rubber-ducking my analysis against Chat-GPT.
On the other hand, I made my first attempt at getting GPT-4 to write some react code for me. It made the exact errors I expect any newbie React programmer to make; from using stale state from a variable capture in a hook to forgetting to set the inputRef on the HTML element.
The trippy thing was: each time I told it "no, that's wrong because X" it fixed the code on the first go.
Still, it did get the code right in the end - and this is literally the first time I ever used Chat-GPT or any other AI tool.
On the one hand, I'm looking forward to learning about prompt engineering; on the other, I'll be careful not to get too invested in the minutiae - it definitely smells like a transitional technology.
I already use it to generate prompts for AI art generation! I'll have it randomly generate scenarios and describe them visually then frequently I'll feed it into another conversation I have that transforms them to something less human friendly and more SD friendly