the biggest difference is that existing multimodal models (eg GPT-4V and MM1) tr...

the biggest difference is that existing multimodal models (eg GPT-4V and MM1) trained the text model first, and then added in the image component after text training was done ('late fusion'). MM1 learns a projection into the text space, not discrete tokens, and thus cannot generate images.

Other work allows the model during training to learn the 'tokenization' more explicitly. that's more similar to Adept's Fuyu architecture, which I am personally a fan of, but also does not enable generating images out.

You can generate images using late fusion as well, though I am not aware of other public work that discloses both early fusion and image generation.