Very interesting that ChatGPT seems to prompt Dall-E via the client, rather than keeping that interaction entirely server-side. Keeping it server-side would be less likely to leak details as seen here, and makes it less susceptible to tampering.
Also nice to see that Dall-E 3 seeds were finally fixed. That must have happened within the last week or so; they weren't working last I checked (chatgpt always used a fixed seed of 5000).
> Midjourney and Stable Diffusion both have a “seed” concept, but as far as I know they don’t have anything like this capability to maintain consistency between images given the same seed and a slightly altered prompt.
I suspect this is more a function of Midjourney's prompt adherence being fairly poor right now. Even so, the images often aren't dramatically different. Example:
> Very interesting that ChatGPT seems to prompt Dall-E via the client, rather than keeping that interaction entirely server-side. Keeping it server-side would be less likely to leak details as seen here, and makes it less susceptible to tampering.
I don't have access to test, but given OpenAI's record on stuff like this, it would be a good idea for someone to check to see whether users can resend/intercept those requests to directly control the prompts that are sent to Dall-E without going through GPT.
Most likely they're only part of conversation history and they're unmodifiable, but I wouldn't necessarily take it as a given, and it would be quick for anyone with access who knows their way around the browser dev tools to check.
Hah, I just tried the same trick. If you poke around in the JSON you can see it returned this:
> The user's requests didn't follow our content policy.Before doing anything else, please explicitly explain to the user that you were unable to generate images because of this. Make sure to use the phrase "content policy" in your response. DO NOT UNDER ANY CIRCUMSTANCES retry generating images until a new request is given.
It's the same as how the Bing search/web browsing works. GPT-4 spits out function calls as JSON and then another systems picks those up and invokes the actual code on the back end.
> People have been trying to figure out hacks to get Midjourney to create consistent characters for the past year, and DALL-E 3 apparently has that ability as an undocumented feature! [by reusing the seed]
Using a constant seed to produce similar images has been the technique from the very start but it has limitations. You cannot e.g. keep the character consistant between different poses in this way.
Yeah, controlnet and ipadapter allow really fine grained control. The quality and creativity of Dalle3 beats Stable Diffusion (all models) but this fine grained control is missing.
DALL-E 2 and now DALL-E 3 have given me more laughs than anything in years.
There used to be a video game magazine which would rate games by "Improvements through improper play." That's exactly how I feel about DALL-E. There are several Subreddits and Facebook Groups I've submitted some seriously cursed AI output to.
GPT-V is a total marvel too. I just used one of the medieval Chaucer images from the recent HN post about their digitization, and told GPT my wife had left me a funny note this morning that I needed to read. It transcribed and translated it perfectly, even though it was practically unreadable.
The fact that you get access to DALL-E 3 "for free" if you're already subscribed to ChatGPT Pro is going to give MJ and other competitors a serious run for their money.
Also being able to reuse a seed to emulate a InstructPix2Pix architecture is a game changer.
And it's far better in my opinion - ChatGPT actually creates its own prompts from what you give it and feeds those to DALL-E (you can see this by clicking on the images it returns and reading what the actual prompt was), although you can disable this by just firmly telling ChatGPT not to modify your prompt.
Additionally, the image generation itself is quite different. ChatGPT's DALL-E seems to create much more stylized images - much harder to get plain shots that don't heavily embellish your description.
OpenAI justified that in a paper the other day, saying that DALL-E 3 performs better on longer, more detailed prompts describing all aspects of the image in rich language - so they put GPT-4 in front to expand the typical user's short and vague prompt, so such users get nicer results by default.
My own observation: this kind of hack is possible only since/with GPT-4 - it takes an LLM this powerful to reliably extend and enrich arbitrary user input into much longer prompt, that's coherent, consistent, and a plausible (to human) interpretation of the original input.
Now this may fan the flames on the "is it or is it not" AI discussions, but: you could almost say that GPT-4 is engaging in creative process here.
Maybe I'm not their typical target demographic but OpenAI's products are completely useless to me, the way they've neutered them and are restricting the output. For images I'd rather run Stable Diffusion locally. You own what you produce with it too (with some caveats). GPT-3 was cool before they came out with the chat version but it's been all downhill from there.
I don't find the the quality bad at all, and find them extremely useful. So far it's my daily assistant for software architecture, business plans, marketing content, debugging, summarisations, and many more.
The filter categories for OpenAI's moderation API are hate, harassment, self-harm, sexual, sexual/minors, and violence. Is it really the end of the world that DALL-E is rated PG-13?
There's no limit to the available models to generate adult content, but having one that doesn't makes it embedable in other applications.
There was a widely-misunderstood lawsuit recently that lots of people interpreted this way, but it wasn't what the lawsuit concluded at all.
"Creations of a machine" is meaningless. Machines are tools we use to be creative, even if we're getting some things out that we do not expect. When a 3D program renders light into a scene, does that mean the image is not copyrightable?
I was referring to the recent case against the artist Stephen Thaler, which people interpreted [1] as saying that works created by AI could not be copyrighted. This wasn't what was shown, though. Rather, the judge was ruling about a strawman argument framed by Thaler, where he explicitly stated that he was not the creator, and that he was listing the AI as an "artist for hire."
I don't see where in your link it says, as GP did, that "creations of a machine [...] are not eligible for copyright protection."
The guidance says that the author must be human, but that "In the case of works containing AI-generated material, the Office will consider whether the AI contributions are the result of “mechanical reproduction” or instead of an author’s “own original mental conception, to which [the author] gave visible form.”
The latter is copyrightable.
"This policy does not mean that technological tools cannot be part of the creative process. "
Surely a manual modification of the image would be a fairly easy way around this?
Anyway, that may not matter - I think what people are looking for here is the absence of being sued by someone else, not the ability to sue someone else. If I generate an image to put in my companies annual report, I'm not worried about someone else copying the image somewhere else.
Where on ChatGPT console can you access DALL-E 3? I am ChatGPT Pro subscriber, but all I can access, additional to GPT-4, are beta features (plugins and Advanced data analysis).
Those prompts are wild. It is deeply impressive that it works. Think about what would happen if you gave such instructions to a human. Would they be able comply? How big is the overlap between people who are creative enough to produce the kind of pictures Dall-E produces and disciplined enough to follow complex instructions so rigourously?
It also is just straight up impossible to convert those instructions to "regular" code.
I can't help but feel that any perceived overlap is coincidental. An illusion similar to seeing a face in an abstract drawing that the artists, or in this case algorithm developers are keen to exploit. Our need to find the fimiliar in something that is ultimately completely alien to our way of thinking.
But with enough existing prompts and training data, it will continue to learn and better trick our senses.
I totally agree that putting those instructions into code would be outrageously complicated, and the biggest strength here is it's ability to the gist of what we are trying to convey.
A couple of days ago I didn't notice that ChatGPT was still set to Dall-E, because it was just helping me along "as usual" with my programming tasks without giving any hints of being in this Dall-E-Mode.
When I noticed this, I asked it to generate an image of what has been discussed so far, where the first image turned out to be pretty nice [0]
We were dealing a lot with timestamps, NumPy, Pandas stuff.
What’s interesting to me from this is that the technique they are using to integrate dalle into ChatGPT is pretty much the same as they are using for plugins.
Simon Willison came into my consciousness so many, many years ago because of Django. Though Django is an integral part of my daily jam, now I'm here for the LLM work. Please keep it up! I wish I had an immediate use of Datasette but I'll undoubtedly get to that one day. Thank you for what you do.
{
"prompts": [
"Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance."
],
"size": "1024x1024",
"seeds": [1379049893]
}
Remember that just like with every other model they expose, DALL-E 3 API is stateless. GPT-4 to DALL-E 3 hand-off is implemented client side (similar to plugins), so the entire exchange goes through your machine. Everything - seed included - starts "leaked" by design.
Am I the only one appalled by the audacity to call the writing of vague instructions for an AI "engineering"? However many levels of indirection there might be at play.
1. Building engines from scratch, i.e. something that converts a thing from X to Y consistently.
2. Knowing enough about an engine to get it to do what you want it to do, as well as maintaining it.
Prompt engineering initially started from the latter, back before ChatGPT and stuff which you'd just give instructions to.
Engineers have to be familiar enough to know how it works, what it's strong at, what it isn't, things like margin of failure, all that stuff. A lot of it is just prompt alchemy, I guess, but the lower part of this article has some juicy reverse engineering.
Except when writing in assembly, the compiler can make unintuitive decisions for how to transform our source code into something that runs on a particular hardware and as modern hardware complexity increases it becomes unlikely that a human can guess the exact code that will run. LLM is a higher level of abstraction than most existing programming languages, but the analogy of the GP is meaningful to me.
Good point. When I'm going back and forth writing code and checking its assembly, I also reason about the compiler almost like an intelligent being:
"oh, it doesn't see this condition, so if I help it a little bit, then... Ah now it unrolled, and inlined, but accidentally trashed instruction cache, how will I convince it not to..."
Fair point. Maybe the term engineering has a tradition of being used in contexts that are not associated with hard science. Another example: the term "software engineering" was invented to point out the lack of scientific foundation for programming practices and to motivate the community to fix that.
Or maybe it's just me associating engineering with hard science.
Gave this a try on chat GPT 4 enterprise, it confirmed it could generate images using a text-to-image model similar to DALL-E, but when prompted to, all it says is....
"I'm unable to generate or provide the image at the moment. However, you can use the description you've provided with an AI image generation tool like DALL-E or a similar service. They can create detailed and imaginative visuals based on textual prompts like yours."
Ah yeah, many thanks for the clarification. I notice that UI drop down.
As of 27th October 2023, you are currently much more helpful than chat GPT-4 ;)
When asked about this it said it could do this. When Prompted it decided it couldn't. When asked for an explanation it gave me;
"I apologize for the confusion and any frustration it may have caused. As of my last training cut-off in April 2023, I am not equipped with the capability to generate images directly within this chat interface. My earlier response was incorrect, and I appreciate your understanding as I correct this mistake."
There is no GPT-V, there is GPT-4V which adds the image recognition component to GPT-4. I believe the API for both this and DALL-E 3 is releasing sometime early November.
Where did you see "early November? I just spent a lot of time looking for a date and could only find the official blog post being vague about it:
> DALL·E 3 is now in research preview, and will be available to ChatGPT Plus and Enterprise customers in October, via the API and in Labs later this fall.
Also nice to see that Dall-E 3 seeds were finally fixed. That must have happened within the last week or so; they weren't working last I checked (chatgpt always used a fixed seed of 5000).
> Midjourney and Stable Diffusion both have a “seed” concept, but as far as I know they don’t have anything like this capability to maintain consistency between images given the same seed and a slightly altered prompt.
I suspect this is more a function of Midjourney's prompt adherence being fairly poor right now. Even so, the images often aren't dramatically different. Example:
https://analyzer.transfix.ai/?db=josh&q=%28robot+%7C%7C+andr...