I looked at this, and thought about it, and then I waited for an hour, and now I looked at it again, and I can't help but think this is useless.
We can already weigh parts of prompts, we can already specify colors or styles for parts of the images. And even if we could not, none of this needs rich text.
In the beginning I even think their comparisons are dishonest. They compare "plaintext" prompts with "rich text" prompts, but the rich text prompts contain more information. What? Like, seriously, who is surprised the following two prompts give different images?
(1) "A girl with long hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose."
(2) "A girl with long [Richtext:orange] hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose. [Footnote:The ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.]"
the worst part is "Font style indicates the styles of local regions". In the comparison with other methods section they actually have to specify in parentheses what each font means style-wise, because nobody knows and (let's be frank) nobody wants to learn.
So why not just use these plaintext parentheses in the prompt?
I really stopped myself from immediately posting my (rather negative) opinion, but after over an hour, it hasn't changed. As far as i can see, this isn't useful, rich text prompts are a gimmick.
Thanks a lot for the comment! (one of the authors here)
RE: plaintext
- The "plain-text" result is just a baseline. We call the "plaintext parentheses in the prompt" full-text (i.e., expanding the rich text info into a long sentence). We show many "full-text" in the paper https://arxiv.org/pdf/2304.06720.pdf.
You can see in Fig 11 that full-text results cannot change the color, style, and do not respect the description. More examples in Figure 13, 14, and 15.
The main issue of using full-text is that it cannot preserve the original plain text image, thereby requiring many rounds of prompt tuning/engineering. We also compared with two other image editing methods, Prompt-to-prompt and InstructPix2Pix. But they could not handle localized editing well. You can see some example comparisons for Color (Figure 4), Style (Figure 5), Footnote (Figure 8), and Font Size (Figure 9). https://arxiv.org/pdf/2304.06720.pdf
RE: Style
- Yes, you can specify what styles you want by just describing it.
I get now that using the side-by-side "plain text"/"rich text" comparisons you're trying to highlight how similar they are, only differing in the regions that are annotated in the rich-text version. But my first impression was that you're comparing against a weak baseline, which doesn't look so good.
The rich text presentation is merely cute. But, the underlying feature is very nice. Being able to focus details on a specific aspect of an image without worrying about it leaking into other aspects would be greatly appreciated.
How about a plain-text interface like this?
> A girl with [long hair](orange) sitting in a cafe, by a table with [coffee](^1) on it, best quality, ultra detailed, dynamic pose. [^1](Ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.)
It feels like that is where the real value is. Imagine describing all the assets of a game, story, or something larger than just a single image as mainly "what" descriptions, referring to broad styles of things. And then a second body of text detailing those styles in detail.
It could be a text description of a fighter or noble wearing coats or armour. And then substitute in different style description of coats and armour depending on the family, class, race or other attributes suitable for the world you're trying to generate.
Yes, you can expand the rich text information into a long sentence. We call this full-text in the paper. The issue of using "full-text" is that it's hard to edit the image interactively. Every time you change the text, you get an entirely different image.
With the same seed, and an extremely similar prompt, why would you get an entirely different image?
If I take seed 9999999 (just example) and my prompts are
(1) "very large gothic church at dusk, spooky, horror, red roses" and
(2) "very large gothic church at dusk, spooky, horror, white roses"
then with all models I tested over the last year or so, you get _very_ similar images, with different colored roses, and (at most) very minor changes eleswhere. this only seems to work if you keep in mind the prompt being parsed left to right, so changes further to the beginning of the prompt have larger effects. Again, of course, you need the same seed.
But, with this said, why would that be any different with plain/full/rich text. Apologies if I am somehow blinkered and asking something really obvious.
Yup, it could be similar, but it mostly only works for very simple prompts (e.g., one subject in the image).
For example, in Figure 11 of the paper (https://arxiv.org/pdf/2304.06720.pdf), you can see that full-text "rustic cabin -> rustic orange cabin" does not turn the cabin orange.
For coloring, the core benefit of our method is that it allows precise color control. For example, it can generate colors with rare names (e.g., Plum Purple or Dodger Blue) or even particular RGB triplets that we cannot describe well with texts.
I had the same thought. The gothic church one, for example. Why wouldn't I just write "A pink gothic church in the sunset" instead of writing "A gothic church" and then having to do the extra steps to turn the word "church" into pink?
Of course, I'm very ignorant of the uses of such tech, so there's probably some usefulness in this.
Because at least with current models, the pink-ness would spread to the rest of the image. You'd end up with not only a pink church but a pink sunset.
It's even worse with styles; midjourney can't do a guitar in one style and the rest of the image in another style. You really only get one style per image.
The value I see is in constructing more complex prompts. Agree with your example but could see myself using this feature for prompts with multiple objects/aspects that require specific details. Probably not much different from inlining all details, just a nice separation of concerns: you can describe the high level requirement first, and then add and tweak individual details.
Exactly, that's the feature that interested me the most. Ideally, the UI for footnotes would be even more rich: e.g. selecting a word would open a small popup to provide more context.
I think the confusion is you're reading this like it is meant to be a presentation of next-gen text-to-image models. It's more like a fancy UI iteration. And I think it can find use cases in different tools.
very reasonable critique, and valuable here despite being negative, because it was well considered. changed my own perspective. thank you for sharing and hope the authors respond.
All of the techniques that they are showing have already existed for awhile in places like Automatic1111/ComfyUI or its extensions (i.e. regional prompting, attention weights). Having it connect so seamlessly with rich text is awesome and is a cool UI trick that might make normies notice it.
While I don't think the rich text thing is particularly useful, I'm very impressed by the approach, especially how it manages to change the resulting image in a way you can control (that is, without regenerating the whole thing and ending up with something with random undesirable changes).
The stability of the overall image during local changes makes me think that maybe this could be a key to video generation (because the biggest problem with existing diffusion-based approach for video is their instability from frame to frame).
I would love to experiment with the idea of font interpretation. People can and do anthropomorphize fonts, but then they have names with meanings which might or might not be useful.
For example, I'm wondering if a prompt written in Comic Sans should be turned into a comic-style illustration, or does it come out as a simplistic and childish drawing? Is a gothic font meant to imply a style of architecture, old Germanic peoples, or goth music and style?
if i understand it correctly, this is an a-priori mapping from font to style, the ml model doesn't interpret the font. I might be wrong tho, the paper isn't super clear.
This is very cool, but it's gimmicky. All of the rich text could simply be a modifier before or after the word (such as an adjective or phrase). Given most LLM work is plain text, this benefit isn't as neatly transferable as prompt engineering.
If you specify the color of an object in a Stable Diffusion prompt the odds of that color appearing somewhere else explode.
Tools like Regional Prompter help a lot with this, and I think the 'region specification by prompt' mode works similarly to Rich Text prompting.
Not even dall-e 3 is free from color/style leaking to other parts of the image yet.
Thanks for the comment! Our method is model-agnostic. It can be easily adapted to any LLM (aka the text-encode) and any text-to-image models.
For example, the method was originally tested in Stable Diffusion 1.4. But we can easily apply it to Stable Diffusion-XL (or any finetuned model like ANIMAGINE-XL) even though the new model has a different text encoder and U-Net weights.
I like this idea. It could be handy to be able to focus on individual descriptions in complex prompts. Is this then mostly a "UI" feature that is being translated to a traditional prompt?
(As a side note: using decorative typefaces was an unconvincing example.)
The UI part is basically a way to organize the user's intention. In the backend, we develop method for extracting "token maps" (i.e., which spatial regions correspond to specific words) and use region-based diffusion to achieve these localized editing results.
We can already weigh parts of prompts, we can already specify colors or styles for parts of the images. And even if we could not, none of this needs rich text.
In the beginning I even think their comparisons are dishonest. They compare "plaintext" prompts with "rich text" prompts, but the rich text prompts contain more information. What? Like, seriously, who is surprised the following two prompts give different images?
(1) "A girl with long hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose."
(2) "A girl with long [Richtext:orange] hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose. [Footnote:The ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.]"
the worst part is "Font style indicates the styles of local regions". In the comparison with other methods section they actually have to specify in parentheses what each font means style-wise, because nobody knows and (let's be frank) nobody wants to learn.
So why not just use these plaintext parentheses in the prompt?
I really stopped myself from immediately posting my (rather negative) opinion, but after over an hour, it hasn't changed. As far as i can see, this isn't useful, rich text prompts are a gimmick.