I like this idea. It could be handy to be able to focus on individual descriptions in complex prompts. Is this then mostly a "UI" feature that is being translated to a traditional prompt?
(As a side note: using decorative typefaces was an unconvincing example.)
The UI part is basically a way to organize the user's intention. In the backend, we develop method for extracting "token maps" (i.e., which spatial regions correspond to specific words) and use region-based diffusion to achieve these localized editing results.
(As a side note: using decorative typefaces was an unconvincing example.)