Hacker News new | past | comments | ask | show | jobs | submit login
Fine Tuning a Diffusion Transformer (DiT) from a Single YouTube Video (oxen.ai)
4 points by gregschoeninger 9 months ago | hide | past | favorite | 2 comments



Hey all,

We were messing around with PixArt as a way to fine tune DiT's for image generation. I was pretty impressed with the results and thought I'd share.

https://www.oxen.ai/ox/PixArtTutorial

In this example I downloaded a video from YouTube (the trailer of Wes Anderson's Asteroid City) chopped up the frames, captioned them with LLaVA, and then trained the model to generate in the style of the video. It's only about 340 frames of data so pretty quick to generate and train.

I also compare against pure prompting, which the model did not have encoded in it's base parameters.

Using PEFT and LoRA, it took less than 3 hours on an A10 GPU on Lambda Labs. So cost about $3 in total. Pretty wild that it worked right out of the gate for that cheap.

Hopefully it inspires others for what they could build!


So how many different prompts did you try it on?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: