Emu Video and Emu Edit, our latest generative AI research milestones

fpgaminer · 2023-11-16T23:44:49 1700178289

Somewhat tangential, but I hadn't heard about the Emu model, which was apparently released (the paper [1] at least) in September. I was curious about the details and read the Emu paper and ... I feel like I'm taking crazy pills reading it.

> To the best of our knowledge, this is the first work highlighting fine-tuning for generically promoting aesthetic alignment for a wide range of visual domains.

... unlike Stable Diffusion which did aesthetic fine tuning when it was released? Or like the thousands of aesthetic finetunes released since?

> We show that the original 4-channel autoencoder design [27] is unable to reconstruct fine details. Increasing channel size leads to much better reconstructions.

Is it not expected that decreasing the compression ratio would lead to better reconstructions? The whole point of the latent diffusion architecture is to make a trade-off here. They're more than welcome to do pixel diffusion if they want better quality, or upscaling architecture.

And then the rest of the paper is this long documentation that can be summed up as "we used industry standard filtering and then human filtering to build an aesthetic dataset which we finetuned a model with". Which, again, has been done a thousand times already.

I really, really don't mean to knock the researcher's work here. I'm just very confused as to why the work is being represented as new or groundbreaking. Contrast to OAI which documents using a diffusion based latent decoder. That's interesting, different, and worth publishing. Scaling up your latent space to get better results is just ... obvious? (As obvious as anything in ML is, anyway). Facebook's research isn't usually this off the mark. E.g. the Emu Edit paper is very interesting and contributes many new methods to the field.

[1] https://scontent-lax3-1.xx.fbcdn.net/v/t39.2365-6/10000000_1...

liuliu · 2023-11-17T02:47:47 1700189267

Yeah. It is still useful for them to share these. My take away:

1. data is all you need to generate these amazing videos with the right gait (gait is something I focused on).

2. nobody doing new network structure, it is animatediff beefed up a little bit with temporal masking applied (neat trick, not a big leap from inpainting task we already see).

3. additional conditioning vector helps, and can be trained, look at these editing tasks!

These are pretty valuable for a looker like me to decipher what they did for Gen-2 or Pika Labs etc.

futureshock · 2023-11-16T16:11:50 1700151110

Emu Edit is awesome. I think we have officially brought this scene from Star Trek to life.

https://m.youtube.com/watch?v=NXX0dKw4SjI&pp=ygUII3Npbm50ZWs...

bane · 2023-11-16T18:47:38 1700160458

With the advent of these models my head cannon now insists that when Star Trek characters say they "programmed" something, they really mean that they have a log of all of their iterative prompts and that there's some optimization the computer can use to aggregate all of those into the final resulting warp model/holodeck simulation/transporter filter/biobed pathogen detector/etc without having to reiterate through all of those prompts again...kind of like a NixOS declarative build.

And when somebody comes along and fixes their program or reprograms what they did, they simply insert or change some of the prompts along the way and get a different effect.

When the characters add new data to the computer (like the episode where Geordi added the psycho profile of the enterprise engine designer), they're just tuning the foundational model with some new input data.

Yeah....that feels right for now to me.

Kerbonut · 2023-11-17T04:06:03 1700193963

Yes absolutely. I’ve started thinking of some interfaces for this type of “programming”. I think we’ll have some pretty cool stuff to play around with in the not too distant future.

Ajedi32 · 2023-11-16T19:27:03 1700162823

> Computer, show me a table.

> There are 5047 classifications of tables on file. Specify design parameters.

Interestingly enough, it seems existing AI models are already better than the Star Trek computer at dealing with ambiguity. Stable Diffusion would just generate a "normal" table and let you go from there.

futureshock · 2023-11-16T20:03:26 1700165006

Yes, they seem to handle emotion, humor and ambiguity better than Data or any computer ever on the shows. 24th century technology, today.

actionfromafar · 2023-11-16T23:34:40 1700177680

OTOH Data is sentient.

futureshock · 2023-11-17T00:17:06 1700180226

So he says, but LLMs will tell you that too.

Ajedi32 · 2023-11-17T05:22:27 1700198547

"The Commander is a physical representation of a dream; an idea, conceived of by the mind of a man. Its purpose: to serve human needs and interests. It's a collection of neural nets and heuristic algorithms; its responses dictated by an elaborate software written by a man, its hardware built by a man. And now -- and now a man will shut it off."

ugh123 · 2023-11-17T03:04:22 1700190262

He was in love once too

cma · 2023-11-16T19:05:12 1700161512

"Computer, load up CELERY MAN, please"

https://www.youtube.com/watch?v=a8K6QUPmv8Q

echelon · 2023-11-16T19:23:45 1700162625

Tim and Eric are going to go crazy with Gen AI. They won't need Adult Swim to toss them shoestring budgets.

yieldcrv · 2023-11-17T01:31:15 1700184675

LLMs and image models are already better than that scene

I can think of solutions for the physical component or simulating the perception of a physical component

clows · 2023-11-16T18:33:55 1700159635

I thought of this Running Man scene https://www.youtube.com/watch?v=BVdOr0z6X7Y

morph123 · 2023-11-16T23:21:12 1700176872

I wonder how far away we are from "make a movie from a sentence".

2030?

Also why do these AI people always end with "this does not replace anyone". Surely they do not believe this?

yieldcrv · 2023-11-17T00:37:20 1700181440

they don't believe it, they are placating an unease in society from people that know they are replaced already. there is a group that plays devil’s advocate with the aforementioned people, and it is convenient to agree with the devil’s advocates.

but there are lots of specialists I used to contract with in the ideation phase that I no longer do.

professional logo designers

testing out names of potential services

designers for landing pages for websites

additional coders for landing pages of websites

templates for powerpoint presentations

graphics for them

many many billable hours for lawyers for things I would have otherwise asked them about and thats totally a risk I’m willing to shoulder. now I simply have them implement unless they are not able to corroborate the legal view. In the past, I would have to explore several paths and then consider switching lawyers after I had all the information I wanted, having the subsequent lawyer implement without any knowledge of why.

some of these ideas generate revenue and I can get to that point far faster and cheaper

I can already code in the latest frameworks and have high proficiency in most media suites, but the media creation was not where I specialize or want to spend time on

so there is a general denial thats kind of useful, if a big company wanted tax breaks from a municipality they can say “look, jobs, we’re big on that”

but everyone knows whats happening

ChatGTP · 2023-11-17T07:35:01 1700206501

You’ll be replaced soon too because right now your click buttoning things to save money, once everyone can click button their app idea into prod?

yieldcrv · 2023-11-17T07:45:23 1700207123

Yes it is a good motivator to entrench revenue streams right now

ChatGTP · 2023-11-17T11:58:08 1700222288

What does this mean and how do you plan to entrench revenue streams where the barrier to cloning your products is next to zero?

yieldcrv · 2023-11-17T15:19:40 1700234380

By making enough right now

calamari4065 · 2023-11-17T01:29:45 1700184585

Reminds me of people uploading music videos to YouTube with "copyright not intended" in the comments

enonimal · 2023-11-16T17:42:09 1700156529

Is anyone able to determine how long it takes to generate a video with one of these methods? Can't find it in the paper.

liuliu · 2023-11-16T22:55:46 1700175346

Emu image is not significantly slower than SDXL or similar. So you would expect to have similar performance as Hotshot. The upscaler (8 frame to 37 frame) version probably would take significantly longer.

davesque · 2023-11-16T22:34:24 1700174064

Definitely looks like progress, but they're still firmly in the center of the uncanny valley.

colesantiago · 2023-11-16T16:22:15 1700151735

Does anyone know where the source code is, I can't seem to find it anywhere.

burningion · 2023-11-16T17:29:41 1700155781

There's some source code in the paper for Emu edit at least. If you look at the supplementary material in the paper, you'll see they spell out the techniques used there too.

I didn't see a repository, but I think in this case, the paper is actually a perfect balance of detail? I think Meta benefits from startups building using their tooling (startups usually buy ads), and so the lack of a full implementation leaves a bit of room for startups to turn the work in to something a bit more production ready.

The cool techniques from the paper are:

Generating a bunch of example images in one go, and using CLiP to score your generated images

And mixing pre-built pipelines and grammars to execute common tasks.

These two ideas alone (with examples) give people in the space plenty to run with.

Great paper!

dado3212 · 2023-11-16T17:27:27 1700155647

I don’t think either of these (or the base Emu model) are open source.

acheong08 · 2023-11-16T19:58:15 1700164695

That’s a bit disappointing. Meta had been on an “open source” roll lately

JaDogg · 2023-11-16T20:52:11 1700167931

First dose of gen AI is free

QuantumG · 2023-11-17T01:22:36 1700184156

I'd be happy if they just sold it.

satvikpendem · 2023-11-16T21:57:54 1700171874

Technically none of their models are actually open source.

acheong08 · 2023-11-17T02:50:29 1700189429

Hence the quotes.

chasd00 · 2023-11-17T02:51:17 1700189477

I never would have guessed the artists would be who AI took out first.

ChatGTP · 2023-11-17T07:23:19 1700205799

How can AI “take out” artists ? It’s an absurd statement.

Maybe career wise? Should art have ever been really considered a career ? It was a nice side effect people might pay for it, outside of that ?

scudsworth · 2023-11-16T22:50:32 1700175032

a huge pile of money on fire forever

echelon · 2023-11-16T23:11:10 1700176270

That's living in a nutshell.

These are pretty great results though, don't you think?

scudsworth · 2023-11-17T15:49:09 1700236149

nope, they look as terrible as everything else in the generative space, and consumers will reject them outright

MrNeon · 2023-11-17T17:50:10 1700243410

Do you think they look as terrible as what was possible two years ago?

If it at some point they go past looking terrible, will you think these in between "terrible" models were a waste of time?

tomdell · 2023-11-16T21:20:26 1700169626

An impressive technical achievement, yes - but the presentation/marketing of this is absurd.

The generated videos are aesthetically horrendous. I don't know what kind of mental gymnastics are going on that they can confidently describe something where the body shapes are nonsensically in flux with every change of frame (look at the eagle's talons, or the dog's leg movements as it runs) as "high-quality video".

Is generative AI hype blinding them to how hideous these videos are, or do they know and they just pretend like it's something it isn't?

anigbrowl · 2023-11-17T00:03:59 1700179439

I don't like them; aesthetically they don't appeal and technically they fall short as you describe. But just about a year ago this was the State of the Art ('Age of Illusion' by Die Antwoord) with visual coherence maintainable for <10 frames or less.

https://www.youtube.com/watch?v=Cq56o0YH3mE

espadrine · 2023-11-17T07:39:29 1700206769

That wasn’t quite the epitome of generated video a year ago; it was barely trained for temporal coherence.

But the best video generators were much worse than Emu Video; there was Make-A-Video[0] from Meta, and Phenaki[1] and Imagen Video[2] from Google.

[0]: https://ai.meta.com/blog/generative-ai-text-to-video/

[1]: https://sites.research.google/phenaki/

[2]: https://imagen.research.google/video/

BoorishBears · 2023-11-16T21:40:37 1700170837

Check out what AI generated images looked like 24 months ago and this comment may feel a lot less pithy.

bastawhiz · 2023-11-16T23:12:06 1700176326

A year ago this technology simply didn't exist at all. What are you expecting?

ShamelessC · 2023-11-16T22:27:39 1700173659

Compared to prior work, it looks unbelievable. Is this just an armchair criticism or have you been paying any attention?

tomdell · 2023-11-17T03:09:19 1700190559

Compared to prior work, it's great. On it's own, I don't agree with describing it as high-quality.

ShamelessC · 2023-11-17T05:42:38 1700199758

Okay.