Somewhat tangential, but I hadn't heard about the Emu model, which was apparently released (the paper [1] at least) in September. I was curious about the details and read the Emu paper and ... I feel like I'm taking crazy pills reading it.
> To the best of our knowledge, this is the first work highlighting fine-tuning for generically promoting aesthetic alignment for a wide range of visual domains.
... unlike Stable Diffusion which did aesthetic fine tuning when it was released? Or like the thousands of aesthetic finetunes released since?
> We show that the original 4-channel autoencoder design [27] is unable to reconstruct fine details. Increasing channel size leads to much better reconstructions.
Is it not expected that decreasing the compression ratio would lead to better reconstructions? The whole point of the latent diffusion architecture is to make a trade-off here. They're more than welcome to do pixel diffusion if they want better quality, or upscaling architecture.
And then the rest of the paper is this long documentation that can be summed up as "we used industry standard filtering and then human filtering to build an aesthetic dataset which we finetuned a model with". Which, again, has been done a thousand times already.
I really, really don't mean to knock the researcher's work here. I'm just very confused as to why the work is being represented as new or groundbreaking. Contrast to OAI which documents using a diffusion based latent decoder. That's interesting, different, and worth publishing. Scaling up your latent space to get better results is just ... obvious? (As obvious as anything in ML is, anyway). Facebook's research isn't usually this off the mark. E.g. the Emu Edit paper is very interesting and contributes many new methods to the field.
Yeah. It is still useful for them to share these. My take away:
1. data is all you need to generate these amazing videos with the right gait (gait is something I focused on).
2. nobody doing new network structure, it is animatediff beefed up a little bit with temporal masking applied (neat trick, not a big leap from inpainting task we already see).
3. additional conditioning vector helps, and can be trained, look at these editing tasks!
These are pretty valuable for a looker like me to decipher what they did for Gen-2 or Pika Labs etc.
With the advent of these models my head cannon now insists that when Star Trek characters say they "programmed" something, they really mean that they have a log of all of their iterative prompts and that there's some optimization the computer can use to aggregate all of those into the final resulting warp model/holodeck simulation/transporter filter/biobed pathogen detector/etc without having to reiterate through all of those prompts again...kind of like a NixOS declarative build.
And when somebody comes along and fixes their program or reprograms what they did, they simply insert or change some of the prompts along the way and get a different effect.
When the characters add new data to the computer (like the episode where Geordi added the psycho profile of the enterprise engine designer), they're just tuning the foundational model with some new input data.
Yes absolutely. I’ve started thinking of some interfaces for this type of “programming”. I think we’ll have some pretty cool stuff to play around with in the not too distant future.
> There are 5047 classifications of tables on file. Specify design parameters.
Interestingly enough, it seems existing AI models are already better than the Star Trek computer at dealing with ambiguity. Stable Diffusion would just generate a "normal" table and let you go from there.
"The Commander is a physical representation of a dream; an idea, conceived of by the mind of a man. Its purpose: to serve human needs and interests. It's a collection of neural nets and heuristic algorithms; its responses dictated by an elaborate software written by a man, its hardware built by a man. And now -- and now a man will shut it off."
they don't believe it, they are placating an unease in society from people that know they are replaced already. there is a group that plays devil’s advocate with the aforementioned people, and it is convenient to agree with the devil’s advocates.
but there are lots of specialists I used to contract with in the ideation phase that I no longer do.
professional logo designers
testing out names of potential services
designers for landing pages for websites
additional coders for landing pages of websites
templates for powerpoint presentations
graphics for them
many many billable hours for lawyers for things I would have otherwise asked them about and thats totally a risk I’m willing to shoulder. now I simply have them implement unless they are not able to corroborate the legal view. In the past, I would have to explore several paths and then consider switching lawyers after I had all the information I wanted, having the subsequent lawyer implement without any knowledge of why.
some of these ideas generate revenue and I can get to that point far faster and cheaper
I can already code in the latest frameworks and have high proficiency in most media suites, but the media creation was not where I specialize or want to spend time on
so there is a general denial thats kind of useful, if a big company wanted tax breaks from a municipality they can say “look, jobs, we’re big on that”
Emu image is not significantly slower than SDXL or similar. So you would expect to have similar performance as Hotshot. The upscaler (8 frame to 37 frame) version probably would take significantly longer.
There's some source code in the paper for Emu edit at least. If you look at the supplementary material in the paper, you'll see they spell out the techniques used there too.
I didn't see a repository, but I think in this case, the paper is actually a perfect balance of detail? I think Meta benefits from startups building using their tooling (startups usually buy ads), and so the lack of a full implementation leaves a bit of room for startups to turn the work in to something a bit more production ready.
The cool techniques from the paper are:
Generating a bunch of example images in one go, and using CLiP to score your generated images
And mixing pre-built pipelines and grammars to execute common tasks.
These two ideas alone (with examples) give people in the space plenty to run with.
An impressive technical achievement, yes - but the presentation/marketing of this is absurd.
The generated videos are aesthetically horrendous. I don't know what kind of mental gymnastics are going on that they can confidently describe something where the body shapes are nonsensically in flux with every change of frame (look at the eagle's talons, or the dog's leg movements as it runs) as "high-quality video".
Is generative AI hype blinding them to how hideous these videos are, or do they know and they just pretend like it's something it isn't?
I don't like them; aesthetically they don't appeal and technically they fall short as you describe. But just about a year ago this was the State of the Art ('Age of Illusion' by Die Antwoord) with visual coherence maintainable for <10 frames or less.
> To the best of our knowledge, this is the first work highlighting fine-tuning for generically promoting aesthetic alignment for a wide range of visual domains.
... unlike Stable Diffusion which did aesthetic fine tuning when it was released? Or like the thousands of aesthetic finetunes released since?
> We show that the original 4-channel autoencoder design [27] is unable to reconstruct fine details. Increasing channel size leads to much better reconstructions.
Is it not expected that decreasing the compression ratio would lead to better reconstructions? The whole point of the latent diffusion architecture is to make a trade-off here. They're more than welcome to do pixel diffusion if they want better quality, or upscaling architecture.
And then the rest of the paper is this long documentation that can be summed up as "we used industry standard filtering and then human filtering to build an aesthetic dataset which we finetuned a model with". Which, again, has been done a thousand times already.
I really, really don't mean to knock the researcher's work here. I'm just very confused as to why the work is being represented as new or groundbreaking. Contrast to OAI which documents using a diffusion based latent decoder. That's interesting, different, and worth publishing. Scaling up your latent space to get better results is just ... obvious? (As obvious as anything in ML is, anyway). Facebook's research isn't usually this off the mark. E.g. the Emu Edit paper is very interesting and contributes many new methods to the field.
[1] https://scontent-lax3-1.xx.fbcdn.net/v/t39.2365-6/10000000_1...