The rate of progress in ML this past year has been breath taking.
I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.
I think these are the main drivers behind the progress:
- Unsupervised learning techniques, e.g. transformers and diffusion models. You need unsupervised techniques in order to utilize enough data. There have been other unsupervised techniques in the past, e.g. GANs, but they don't work as well.
- Massive amounts of training data.
- The belief that training these models will produce something valuable. It costs between hundreds of thousands to millions of dollars to train these models. The people doing the training need to believe they're going to get something interesting out at the end. More and more people and teams are starting to see training a large model as something worth pursuing.
- Better GPUs, which enables training larger models.
- Honestly the fall of crypto probably also contributed, because miners were eating a lot of GPU time.
I don't think transformers or diffusion models are inherently "unsupervised", especially not the way they're used in Stable Diffusion and related models (which are very much trained in a supervised fashion). I agree with the rest of your points though.
I disagree. Diffusion models are trained to generate the probability distribution of their training dataset, like other generative models (GAN, VAE, etc). The fact that the architecture is a Transformer (or a CNN with attention like in Stable Diffusion) is orthogonal to the generative vs discriminative divide.
Unsupervised is a confusing term as there is always an underlying loss being optimized and working as a supervision signal, even for good old kmeans. But generative models are generally considered to be part of unsupervised methods.
> The belief that training these models will produce something valuable
Exactly. The growth in the next decade is going to be unimaginable because now governments and MNCs believe that there realistically be progress made in this field.
One factor is that Stable Diffusion and ChatGPT were released within 3 months of each other – August 22, 2022 and November 3, 2022, respectively. That brought a lot of attention and excitement to the field. More excitement, more people, more work being done, more progress.
Of course those two releases didn't fall out of the sky.
Attention, transformers, diffusion. Prior image synthesis techniques - i.e. GANs - had problems that made it difficult to scale them up, whereas the current techniques seem to have no limit other than the amount of RAM in your GPU.
> But what technically allowed for so much progress?
The availability of GPU compute time. Up until the Russian invasion into Ukraine, interest rates were low AF so everyone and their dog thought it would be a cool idea to mine one or another sort of shitcoin. Once rising interest rates killed that business model for good, miners dumped their GPUs on the open market, and an awful lot of cloud computing capacity suddenly went free.
Public availability of large transformer-based foundation models trained at great expense, which is what OP is referring to, is definitely unprecedented.
People figuring out how to train and scale newer architectures (like transfomers) effectively, to be wildly larger than ever before.
Take AlexNet - the major "oh shit" moment in image classification.
It had an absolutely mind-blowing number of parameters at a whopping 62 million.
Holy shit, what a large network, right?
Absolutely unprecedented.
Now, for language models, anything under 1B parameters is a toy that barely works.
Stable diffusion has around 1B or so - or the early models did, I'm sure they're larger now.
A whole lot of smart people had to do a bunch of cool stuff to be able to keep networks working at all at that size.
Many, many times over the years, people have tried to make larger networks, which fail to converge (read: learn to do something useful) in all sorts of crazy ways.
At this size, it's also expensive to train these things from scratch, and takes a shit-ton of data, so research/discovery of new things is slow and difficult.
But, we kind of climbed over a cliff, and now things are absolutely taking off in all the fields around this kind of stuff.
Take a look at XTTSv2 for example, a leading open source text-to-speech model. It uses multiple models in its architecture, but one of them is GPT.
There are a few key models that are still being used in a bunch of different modalities like CLIP, U-Net, GPT, etc. or similar variants. When they were released / made available, people jumped on them and started experimenting.
There has been massive progress in ML every year since 2013, partly due to better GPUs and lots of training data. Many are only taking notice now that it is in products but it wasn't that long ago there was skepticism on HN even when software like Codex existed in 2021.
Where do you want to start? The Internet collection and structuring the world's knowledge into a few key repositories? The focus on GPUs in gaming and then the crypto market creating a suite of libraries dedicated to hard scaling math. Or then the miniaturization and focus on energy efficiency due to phones making scaled training cost-effective. Finally the papers released by Google and co which didn't seem to recognise quite how easy it would be to build and replicate upon. Nothing was unlocked apart from a lot of people suddenly noticed how doable all this already was.
I mean, you probably didn't pay much attention to battery capacity before phones, laptops, and electric cars, right? Battery capacity has probably increased though at some rate before you paid attention. It's just when something actually becomes relevant that we notice.
Not that more advances don't happen with sustained hype, just there's some sort of tipping point involving usefulness based either on improvement of the thing in question or it's utility elsewhere.
I have seen them, the workflows to create those videos are extremely labor intensive. Control net lets you maintain poses between frames, it doesn’t solve the temporal consistency of small details.
No I think we’re actually close. My source is I’m working on this problem and the incredible progress of our tiny 3 person team at drip.art (http://api.drip.art) - we can generate a lot of frames that are consistent, and with interpolation between them, smoothly restyle even long videos. Cross-frame attention works for most cases, it just needs to be scaled up.
And that’s just for diffusion focused approaches like ours. There are probably other techniques from the token flow or nerf family of approaches close to breakout levels of quality, tons of talented researchers working on that too.
The demo clips on the site are cool, but when you call it a "solved problem," I'd expect to see panning, rotating, and zooming within a cohesive scene with multiple subjects.
Thanks for checking it out! We’re certainly not done yet, but much of what you ask is possible or will be soon on the modeling side and we need tools to expose that to a sane workflow in traditional video editors.
Once a video can show a person twisting round, and their belt buckle is the same at the end as it was at the start of the turn, it's solved. VFX pipelines need consistency. TC is a long, long way from being solved, except by hitching it to 3DMMs and SMPL models (and even then, the results are not fabulous yet).
> Haven't you seen the insane quality of videos on civitai?
I have not, so I went to https://civitai.com/ which I guess is what you're talking about? But I cannot find a single video there, just images and models.
Not sure I'd call that "insane quality", more like neat prototypes. I'm excited where things will be in the future, but clearly it has a long way to go.
A small percentage of the images are animations. This id (for obvious reasons) particularly common for images used on the catalog pages for animation-related tools and models, but also its not uncommon for (AnimateDiff-based, mostly) animations to be used to demo the output of other models.
I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.