Hacker News new | past | comments | ask | show | jobs | submit login

This arch seems to be flexible enough to extends to video easily. Hopefully what we have here will be another "foundation" blocks like the transformer blocks in LLaMA.

Why:

It looks generic enough to incorporated text encoding / timestep condition into the block in all the imaginable ways (rather than in limited ways in SDXL / SD v1, or Stable Cascade). I don't think there is much left to be done there other than to play with positional encoding (2D RoPE?).

Great job! Now let's just scale up the transformers and focus on quantization / optimizations to run this stack properly everywhere :)




The paper has preliminary results for video as well




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: