This arch seems to be flexible enough to extends to video easily. Hopefully what...

This arch seems to be flexible enough to extends to video easily. Hopefully what we have here will be another "foundation" blocks like the transformer blocks in LLaMA.

Why:

It looks generic enough to incorporated text encoding / timestep condition into the block in all the imaginable ways (rather than in limited ways in SDXL / SD v1, or Stable Cascade). I don't think there is much left to be done there other than to play with positional encoding (2D RoPE?).

Great job! Now let's just scale up the transformers and focus on quantization / optimizations to run this stack properly everywhere :)