Hacker News new | past | comments | ask | show | jobs | submit login
Now add a walrus: Prompt engineering in DALL-E 3 (simonwillison.net)
288 points by simonw 10 months ago | hide | past | favorite | 72 comments



Very interesting that ChatGPT seems to prompt Dall-E via the client, rather than keeping that interaction entirely server-side. Keeping it server-side would be less likely to leak details as seen here, and makes it less susceptible to tampering.

Also nice to see that Dall-E 3 seeds were finally fixed. That must have happened within the last week or so; they weren't working last I checked (chatgpt always used a fixed seed of 5000).

> Midjourney and Stable Diffusion both have a “seed” concept, but as far as I know they don’t have anything like this capability to maintain consistency between images given the same seed and a slightly altered prompt.

I suspect this is more a function of Midjourney's prompt adherence being fairly poor right now. Even so, the images often aren't dramatically different. Example:

https://analyzer.transfix.ai/?db=josh&q=%28robot+%7C%7C+andr...


> Very interesting that ChatGPT seems to prompt Dall-E via the client, rather than keeping that interaction entirely server-side. Keeping it server-side would be less likely to leak details as seen here, and makes it less susceptible to tampering.

I don't have access to test, but given OpenAI's record on stuff like this, it would be a good idea for someone to check to see whether users can resend/intercept those requests to directly control the prompts that are sent to Dall-E without going through GPT.

Most likely they're only part of conversation history and they're unmodifiable, but I wouldn't necessarily take it as a given, and it would be quick for anyone with access who knows their way around the browser dev tools to check.


Yup you can control it directly easily enough:

Generating images in one chat: https://i.imgur.com/sIKSfCy.png

Reproducing exactly in another: https://i.imgur.com/C8Tqo48.png


Interesting that I managed to bypass the first filter but then the Dall-E backend refused to comply: https://i.imgur.com/a/qikzBkP.jpg


Hah, I just tried the same trick. If you poke around in the JSON you can see it returned this:

> The user's requests didn't follow our content policy.Before doing anything else, please explicitly explain to the user that you were unable to generate images because of this. Make sure to use the phrase "content policy" in your response. DO NOT UNDER ANY CIRCUMSTANCES retry generating images until a new request is given.


I love that we’re are the stage where we are demanding computers follow instructions in all caps


Well, that’s where we started with them (PRINT “HELLO”), so it’s nice to see we’ve come full circle.


even more amusingly, it is a computer program demanding that of another computer program


Never trust the client!


It's the same as how the Bing search/web browsing works. GPT-4 spits out function calls as JSON and then another systems picks those up and invokes the actual code on the back end.


The ip2adapter from Tencent does the same for SD.


> People have been trying to figure out hacks to get Midjourney to create consistent characters for the past year, and DALL-E 3 apparently has that ability as an undocumented feature! [by reusing the seed]

Using a constant seed to produce similar images has been the technique from the very start but it has limitations. You cannot e.g. keep the character consistant between different poses in this way.


It's easy with ControlNet OpenPose


Yeah, controlnet and ipadapter allow really fine grained control. The quality and creativity of Dalle3 beats Stable Diffusion (all models) but this fine grained control is missing.


DALL-E 2 and now DALL-E 3 have given me more laughs than anything in years.

There used to be a video game magazine which would rate games by "Improvements through improper play." That's exactly how I feel about DALL-E. There are several Subreddits and Facebook Groups I've submitted some seriously cursed AI output to.

GPT-V is a total marvel too. I just used one of the medieval Chaucer images from the recent HN post about their digitization, and told GPT my wife had left me a funny note this morning that I needed to read. It transcribed and translated it perfectly, even though it was practically unreadable.


The fact that you get access to DALL-E 3 "for free" if you're already subscribed to ChatGPT Pro is going to give MJ and other competitors a serious run for their money.

Also being able to reuse a seed to emulate a InstructPix2Pix architecture is a game changer.


You can use Dalle 3 completely free on the Bing AI website.


And it's far better in my opinion - ChatGPT actually creates its own prompts from what you give it and feeds those to DALL-E (you can see this by clicking on the images it returns and reading what the actual prompt was), although you can disable this by just firmly telling ChatGPT not to modify your prompt.

Additionally, the image generation itself is quite different. ChatGPT's DALL-E seems to create much more stylized images - much harder to get plain shots that don't heavily embellish your description.


OpenAI justified that in a paper the other day, saying that DALL-E 3 performs better on longer, more detailed prompts describing all aspects of the image in rich language - so they put GPT-4 in front to expand the typical user's short and vague prompt, so such users get nicer results by default.

My own observation: this kind of hack is possible only since/with GPT-4 - it takes an LLM this powerful to reliably extend and enrich arbitrary user input into much longer prompt, that's coherent, consistent, and a plausible (to human) interpretation of the original input.

Now this may fan the flames on the "is it or is it not" AI discussions, but: you could almost say that GPT-4 is engaging in creative process here.


Maybe I'm not their typical target demographic but OpenAI's products are completely useless to me, the way they've neutered them and are restricting the output. For images I'd rather run Stable Diffusion locally. You own what you produce with it too (with some caveats). GPT-3 was cool before they came out with the chat version but it's been all downhill from there.


I don't find the the quality bad at all, and find them extremely useful. So far it's my daily assistant for software architecture, business plans, marketing content, debugging, summarisations, and many more.

Have you used gpt4?


The filter categories for OpenAI's moderation API are hate, harassment, self-harm, sexual, sexual/minors, and violence. Is it really the end of the world that DALL-E is rated PG-13?

There's no limit to the available models to generate adult content, but having one that doesn't makes it embedable in other applications.


>You own what you produce with it too

As I understand it, this is incorrect - all of these outputs are the creations of a machine, and as such are not eligible for copyright protection


There was a widely-misunderstood lawsuit recently that lots of people interpreted this way, but it wasn't what the lawsuit concluded at all.

"Creations of a machine" is meaningless. Machines are tools we use to be creative, even if we're getting some things out that we do not expect. When a 3D program renders light into a scene, does that mean the image is not copyrightable?


Which lawsuit are you referring to? I'm pretty sure GP is referring to the US Copyright Office guidance on copyrighting AI output in the US:

https://copyright.gov/ai/


I was referring to the recent case against the artist Stephen Thaler, which people interpreted [1] as saying that works created by AI could not be copyrighted. This wasn't what was shown, though. Rather, the judge was ruling about a strawman argument framed by Thaler, where he explicitly stated that he was not the creator, and that he was listing the AI as an "artist for hire."

I don't see where in your link it says, as GP did, that "creations of a machine [...] are not eligible for copyright protection."

The guidance says that the author must be human, but that "In the case of works containing AI-generated material, the Office will consider whether the AI contributions are the result of “mechanical reproduction” or instead of an author’s “own original mental conception, to which [the author] gave visible form.”

The latter is copyrightable.

"This policy does not mean that technological tools cannot be part of the creative process. "

1. e.g. https://boingboing.net/2023/08/21/federal-judge-says-ai-gene...


Surely a manual modification of the image would be a fairly easy way around this?

Anyway, that may not matter - I think what people are looking for here is the absence of being sued by someone else, not the ability to sue someone else. If I generate an image to put in my companies annual report, I'm not worried about someone else copying the image somewhere else.


Where on ChatGPT console can you access DALL-E 3? I am ChatGPT Pro subscriber, but all I can access, additional to GPT-4, are beta features (plugins and Advanced data analysis).


For me it’s available in the app but not the website, although I can access the conversations that have generated images on the website.


I also don't seem to have access. It seems that it's being slowly rolled out to subscribers, and maybe you and I are on the tail end.


I just got it!


Those prompts are wild. It is deeply impressive that it works. Think about what would happen if you gave such instructions to a human. Would they be able comply? How big is the overlap between people who are creative enough to produce the kind of pictures Dall-E produces and disciplined enough to follow complex instructions so rigourously?

It also is just straight up impossible to convert those instructions to "regular" code.


I can't help but feel that any perceived overlap is coincidental. An illusion similar to seeing a face in an abstract drawing that the artists, or in this case algorithm developers are keen to exploit. Our need to find the fimiliar in something that is ultimately completely alien to our way of thinking.

But with enough existing prompts and training data, it will continue to learn and better trick our senses.

I totally agree that putting those instructions into code would be outrageously complicated, and the biggest strength here is it's ability to the gist of what we are trying to convey.


But it mostly didn't comply. The overlong prompts consistently contain details that are ignored.


A couple of days ago I didn't notice that ChatGPT was still set to Dall-E, because it was just helping me along "as usual" with my programming tasks without giving any hints of being in this Dall-E-Mode.

When I noticed this, I asked it to generate an image of what has been discussed so far, where the first image turned out to be pretty nice [0]

We were dealing a lot with timestamps, NumPy, Pandas stuff.

https://imgur.com/a/SBZ36KT


I love the way the cars are often “driving on water”. So perfect yet so wrong.


Simulating Red Bulls next set of upgrades no doubt.


What’s interesting to me from this is that the technique they are using to integrate dalle into ChatGPT is pretty much the same as they are using for plugins.


Simon Willison came into my consciousness so many, many years ago because of Django. Though Django is an integral part of my daily jam, now I'm here for the LLM work. Please keep it up! I wish I had an immediate use of Datasette but I'll undoubtedly get to that one day. Thank you for what you do.


Are you sure the seed number isn’t a hallucination? I think it might be an internal reference which vaguely alludes to the original prompt.

The seed has no persistent meaning beyond the chat instance, so I think you could get same effect by referring a previous image with prose.


The seed isn't a hallucination, I proved that to myself by reviewing the underlying JSON, see this section: https://simonwillison.net/2023/Oct/26/add-a-walrus/#peeking-...

    {
      "prompts": [
         "Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance."
      ],
      "size": "1024x1024",
      "seeds": [1379049893]
    }


It's definitely the true params! https://ibb.co/jJPm7Bq


I thought the same, it seems strange it would be added to that json as a key separate to the prompt though


Why? That's just how the image generators technically work.


Because it seems to me like poor engineering to leak it. Not disputing it, just surprised to see it


Remember that just like with every other model they expose, DALL-E 3 API is stateless. GPT-4 to DALL-E 3 hand-off is implemented client side (similar to plugins), so the entire exchange goes through your machine. Everything - seed included - starts "leaked" by design.


Maybe I’m missing something. What is the threat of exposing the seed? It just makes images reproducible.


Am I the only one appalled by the audacity to call the writing of vague instructions for an AI "engineering"? However many levels of indirection there might be at play.


I see two definitions of engineering:

1. Building engines from scratch, i.e. something that converts a thing from X to Y consistently.

2. Knowing enough about an engine to get it to do what you want it to do, as well as maintaining it.

Prompt engineering initially started from the latter, back before ChatGPT and stuff which you'd just give instructions to.

Engineers have to be familiar enough to know how it works, what it's strong at, what it isn't, things like margin of failure, all that stuff. A lot of it is just prompt alchemy, I guess, but the lower part of this article has some juicy reverse engineering.


As opposed to writing vague instructions for a compiler?


How do you mean? Do you mean source code? What's vague about that?


Except when writing in assembly, the compiler can make unintuitive decisions for how to transform our source code into something that runs on a particular hardware and as modern hardware complexity increases it becomes unlikely that a human can guess the exact code that will run. LLM is a higher level of abstraction than most existing programming languages, but the analogy of the GP is meaningful to me.


Good point. When I'm going back and forth writing code and checking its assembly, I also reason about the compiler almost like an intelligent being:

"oh, it doesn't see this condition, so if I help it a little bit, then... Ah now it unrolled, and inlined, but accidentally trashed instruction cache, how will I convince it not to..."


This made me look into the etymology of "engineer", and I was slightly surprised that it is all about weapons, and not so much bridges.


If social engineering counts, this should too.


Fair point. Maybe the term engineering has a tradition of being used in contexts that are not associated with hard science. Another example: the term "software engineering" was invented to point out the lack of scientific foundation for programming practices and to motivate the community to fix that.

Or maybe it's just me associating engineering with hard science.


Gave this a try on chat GPT 4 enterprise, it confirmed it could generate images using a text-to-image model similar to DALL-E, but when prompted to, all it says is....

"I'm unable to generate or provide the image at the moment. However, you can use the description you've provided with an AI image generation tool like DALL-E or a similar service. They can create detailed and imaginative visuals based on textual prompts like yours."

Guess I'll try again later.


You need to make sure you select the correct model to do it. Hover over GPT-4 and select DALL-E 3. Should work then!


Ah yeah, many thanks for the clarification. I notice that UI drop down. As of 27th October 2023, you are currently much more helpful than chat GPT-4 ;)

When asked about this it said it could do this. When Prompted it decided it couldn't. When asked for an explanation it gave me; "I apologize for the confusion and any frustration it may have caused. As of my last training cut-off in April 2023, I am not equipped with the capability to generate images directly within this chat interface. My earlier response was incorrect, and I appreciate your understanding as I correct this mistake."


Thank you for this, I had trouble figuring out how to try out Dalle from ChatGPT


Love the idea of prompting for sesame street characters, someone tell him his F1 cars are on 3 wheels in the last image though...


Any ideas on if/when (image+text)->image models will be available?

Would be good to be able to iterate on images (keep this, change that etc).

The use of the seed looks useful but I'm guessing it has its own limitations.


Of course it has to produce "diversity". I wonder what's the incentive to do that.


Perhaps the incentive is to reflect the world we live in?


This implies the training dataset does not reflect the world.

Adding minority characters to every group of four people does not reflect even American reality, let alone other countries'.


Is DALL-E 3 and GPT-V now supported in the API?


There is no GPT-V, there is GPT-4V which adds the image recognition component to GPT-4. I believe the API for both this and DALL-E 3 is releasing sometime early November.


Where did you see "early November? I just spent a lot of time looking for a date and could only find the official blog post being vague about it:

> DALL·E 3 is now in research preview, and will be available to ChatGPT Plus and Enterprise customers in October, via the API and in Labs later this fall.


(edit: wrong)


The "second one" he's referring to does not have those issues.

https://static.simonwillison.net/static/2023/dalle-3/add-wal...


Sure Simon. Let's see you try to get a minimalist image out of it.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: