The rich text presentation is merely cute. But, the underlying feature is very nice. Being able to focus details on a specific aspect of an image without worrying about it leaking into other aspects would be greatly appreciated.
How about a plain-text interface like this?
> A girl with [long hair](orange) sitting in a cafe, by a table with [coffee](^1) on it, best quality, ultra detailed, dynamic pose. [^1](Ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.)
It feels like that is where the real value is. Imagine describing all the assets of a game, story, or something larger than just a single image as mainly "what" descriptions, referring to broad styles of things. And then a second body of text detailing those styles in detail.
It could be a text description of a fighter or noble wearing coats or armour. And then substitute in different style description of coats and armour depending on the family, class, race or other attributes suitable for the world you're trying to generate.
Yes, you can expand the rich text information into a long sentence. We call this full-text in the paper. The issue of using "full-text" is that it's hard to edit the image interactively. Every time you change the text, you get an entirely different image.
With the same seed, and an extremely similar prompt, why would you get an entirely different image?
If I take seed 9999999 (just example) and my prompts are
(1) "very large gothic church at dusk, spooky, horror, red roses" and
(2) "very large gothic church at dusk, spooky, horror, white roses"
then with all models I tested over the last year or so, you get _very_ similar images, with different colored roses, and (at most) very minor changes eleswhere. this only seems to work if you keep in mind the prompt being parsed left to right, so changes further to the beginning of the prompt have larger effects. Again, of course, you need the same seed.
But, with this said, why would that be any different with plain/full/rich text. Apologies if I am somehow blinkered and asking something really obvious.
Yup, it could be similar, but it mostly only works for very simple prompts (e.g., one subject in the image).
For example, in Figure 11 of the paper (https://arxiv.org/pdf/2304.06720.pdf), you can see that full-text "rustic cabin -> rustic orange cabin" does not turn the cabin orange.
For coloring, the core benefit of our method is that it allows precise color control. For example, it can generate colors with rare names (e.g., Plum Purple or Dodger Blue) or even particular RGB triplets that we cannot describe well with texts.
How about a plain-text interface like this?
> A girl with [long hair](orange) sitting in a cafe, by a table with [coffee](^1) on it, best quality, ultra detailed, dynamic pose. [^1](Ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.)