Fine Tuning a Diffusion Transformer (DiT) from a Single YouTube Video

gregschoeninger · 2024-05-31T00:09:31 1717114171

Hey all,

We were messing around with PixArt as a way to fine tune DiT's for image generation. I was pretty impressed with the results and thought I'd share.

https://www.oxen.ai/ox/PixArtTutorial

In this example I downloaded a video from YouTube (the trailer of Wes Anderson's Asteroid City) chopped up the frames, captioned them with LLaVA, and then trained the model to generate in the style of the video. It's only about 340 frames of data so pretty quick to generate and train.

I also compare against pure prompting, which the model did not have encoded in it's base parameters.

Using PEFT and LoRA, it took less than 3 hours on an A10 GPU on Lambda Labs. So cost about $3 in total. Pretty wild that it worked right out of the gate for that cheap.

Hopefully it inspires others for what they could build!

sthoward · 2024-05-31T02:04:02 1717121042

So how many different prompts did you try it on?