Hacker News new | past | comments | ask | show | jobs | submit login
AI Content Generation, Part 1: Machine Learning Basics (jonstokes.com)
133 points by jger15 on Sept 12, 2022 | hide | past | favorite | 57 comments



“If you’ve been following the red-hot AI content generation scene, then you’re probably already aware that it’s likely a matter of months before authentic-looking video of newly invented Seinfeld bits like this can be generated by anyone with a few basic, not-especially-technical skills.”

A matter of months eh?

To me that claim looks to be about driving up business. I realize lots of people and companies are making similar claims, and I have no doubt we will get there in a reasonable amount if time, but I believe that people with your level of expertise believe it will be at least at year until we achieve what you decide.

This whole thing is remarkable but now people are overselling it IMO.


Depending on how you're measuring it, video generation will come in less than 12 months, or it's already here.

- In the Video Diffusion paper[1] they generate 64x64 16 frame videos, model is not released, but an open source implementation using an imagen-like pipeline exists[2], no model is available publicly.

- The CogVideo paper[3], which is essentially a huge transformer model, can generate 480x480 videos, the code for the paper and models are open source[4], but be warned that you need a huge GPU like an A100 to run the damned thing.

The future of text to video generation will probably be video diffusion, i.e. using 3D UNets, or more likely a much more optimized version of a UNet. Improvements on diffusion models are happening on a daily basis, and we're probably a couple papers away from having a breakthrough in efficiency that we can generate good looking videos on a high end GPU.

[1] https://arxiv.org/abs/2204.03458 [2] https://github.com/lucidrains/imagen-pytorch/tree/main/image... [3] https://arxiv.org/abs/2205.15868 [4] https://github.com/THUDM/CogVideo


It's very likely that realistic looking (to the level of Stable Diffusion) video will happen and tools to create it will be available within 12 months (maybe 80% likelihood).

What is likely to be missing is the ability to control that video in useful ways directly from prompts. There will be some type of direction, but as anyone who has spent time doing prompt-based image generation actual control isn't there yet.


I’ll take the 20%. Text-to-video won’t be good enough to be called realistic in a year.


I admit this gave me pause when you said this. But I'd note my comment was pretty specific:

> It's very likely that realistic looking (to the level of Stable Diffusion) video will happen and tools to create it will be available within 12 months (maybe 80% likelihood).

> What is likely to be missing is the ability to control that video in useful ways directly from prompts.

ie, I'm not claiming it will entirely text based.

As an example of the kind of things we'll see I'd point at https://twitter.com/karenxcheng/status/1564626773001719813

This is using live video as a source, but I think an integrated version of this combined with some kind maybe game based interface to script it is achievable.


> but as anyone who has spent time doing prompt-based image generation actual control isn't there yet

It's not, but there are many efforts to fix this and they solve most problems, although I admit they're not ideal. One way you can control generation is simply by messing around with per token weights, for example if your prompt is being partially ignored, you can use weights to have the guidance give more emphasis on that part of the prompt. Another way you can control it is by using img2img and input as simple drawing, this can help it better understand say, which color you want things to be. The best tool of all is of course, copious amounts of inpainting and doing composition, eyes not the right color? inpaint them in, messed up hand? inpaint, etc. If your issue is that you can't generate a specific character or style at all, you have tools like text-inversion that can create a special token that represents the rough idea, assuming you already have a couple images that represent that idea already.

There is a real fix tho, the big issue with these prompts is that they use the clip text encoder, the clip text encoder was trained only on image captions, which means that it's understanding of the world is limited to whatever is represented by captions that exist on images found on the internet, a very limited subset of language, limiting the quality of the generated embeddings as the model not only is bad at basic language, it doesn't properly understand the relationships between words and such. Image models that use a large language model (LLM) listen to the prompt much better, since these LLM models generate much less noisy embeddings, a LLM contains embeddings big enough that the diffusion model can actually spell, i.e. if you ask for a sign that says something, the sign will come out with proper spelling and font choice. Sadly there's no such model open to the public yet, but stability.ai is currently training such a model and we can expect its release hopefully before christmas.

Having a image model trained with a LLM text encoder and then applying the tricks used to improve current models will give you an unprecedented level of control over image generation, the next step is probably having an instruction based model that takes both an image and text as input, and runs the instruction over the image. Example, you give it a image of a person and you send the instruction of "draw this in the style of pixel art" and it'll do it for you, it's already currently possible with img2img, but having a model than listens to instructions can extend this to more useful things like "remove the background", "add another cat", "tilt the sign 90 degrees", "make everything black and white except her dress", etc.


Could you for example create a simple 3d render with a very simple environment with cubes resembling houses and blobs as trees etc and a camera moving through that environment. Then use video generation in way like img2img works? That would be very great.


Yup, that would be easily doable, the code would essentially be the same.


Stable Diffusion Video generation is kind of here[1] I guess Stability Team will partner up with more open source devs to accelerate this.

[1] https://www.youtube.com/watch?v=yyxgv6MxSDk


This is not really what is being suggested, this repo just navigates the latent space between 2 prompts and generates an image at specific intervals. It create cool effects but will never be able to generate a coherent video of for example "a man walking on a beach".


>> it’s likely a matter of months

> A matter of months eh?

Right. As always people overestimate the effect of a technology in the short run and underestimate the effect in the long run.

It was 15 months between Dall-E and Dall-E 2. And it's already been 5 months since the release of Dall-E 2.

To get homemade Seinfeld bits we'll presumably have to wait for some big research org to release a paper and then it's a matter of months on top of that.

Unless hobbyists can take the current diffusion technology from images and simple videos to complicated videos and video + sound. But I doubt that.


> but I believe that people with your level of expertise

What expertise? It's just a marketing article written by a journalist who writes about all kinds of stuff. It's not written by someone with actual expertise in building models or doing research in the field. I'm surprised this marketing piece is even being upvoted because it's void of anything technical or deep.


This seems a bit pedantic? “A matter of months!? Psh, more like a year”

Either is insanely fast and the distinction is not particularly notable.

Also a matter of months seems an entirely reasonable estimate given the insane velocity around this stuff. I’ve already seen some video generation models.


I'll be pedantic again ... you changed what I wrote from "at least" a year to "more like" a year, which is actually quite different.


So we can already make video. But the "authentic-looking" part is going to be quite a bit. Even current face work still has a lot of issues that are a bit more subtle. The diversity in a whole scene is higher. Then also you'd need convincing voice deep fakes, which does exist. But then you need to integrate this and/or the lip syncing into the prompts (totally doable). Right now peopel are trying to smooth variance between frames.

I'd suspect we'll get something 360p-480p quality in <2 years (definitely less than 5). Idk, is 24 months "a matter of months"?


I’m of similar opinion. I keep seeing claims of smooth AI video at the level of existing image generation ‘coming soon’ but I haven’t seen any evidence that it’s even close. Few years? Sure.


I expect short video generation to be at the level of DALLE 2 in 1 year — e.g. able to get the gist of a prompt, but with lots of artifacts, requiring a lot of compute, and frequently ignoring large parts of the prompt.


I think we can have it very quickly if someone threw a ton of money at it. The question is who wants to take that bet and run at a loss for ages.


Self driving cars followed the same pattern - the demo is decent but the 99.9999% case necessary for real-world usage is perpetually out of reach


What is acceptable for image and content generation has almost no relation to what is acceptable for safety-critical systems.


Given how insanely high the bar is in creative industries, I’m not sure how deep a purely AI generated graphic can succeed


I don’t think the bar is uniformly that high. Go look at the covers of some Kindle Unlimited titles – Midjourney art is better


> it’s likely a matter of months before authentic-looking video of newly invented Seinfeld bits like this can be generated by anyone with a few basic, not-especially-technical skills

Let's define "authentic-looking video" as a clip at least 30 seconds long, with good lip-sync, integrating realistic voice, without significant logical incoherence and able to follow prompts that impact both the "scenario" and the "decor" of the video at least reasonably well. That won't happen in a year and there is a good chance it won't happen in 2 years. Production-ready quality for minute-long videos involving an actual plot, even after human post-processing (excluding very significant human involvement) will not be there in the next 5 years.


I never understood the hype around diffusion models. Same thing for deep fakes. If you look closely, these models make horrendous mistakes such as hair growing out of ear, uneven eyes, etc., which shows that there is no sense of understanding the underlying concepts in these models. I find GPT-3 a great achievement, but when it comes to visual arts, you can't just call whatever SD/DE2 create "art". At best, these models just reflect the patterns and visual cues they've seen in the training data, but they have no idea as to "what" they are painting. It's actually funny that one of the first applications of these models is in arts where there's no real-world hard-rule to quantify just how terrible/great the results are, because at the end of the day, people will claim that art is all subjective.


> but when it comes to visual arts, you can't just call whatever SD/DE2 create "art"

How do you know? I would bet most people can't tell what's been AI-generated from what hasn't. That's how the AI debacle of The Atlantic got started.

For example, here's what John Naughton had to say about it in The Guardian:

> what really caught my eye was the striking illustration that headed the piece. It showed a cartoonish image of a dishevelled Jones in some kind of cavern surrounded by papers, banknotes, prescriptions and other kinds of documents. Rather good, I thought, and then inspected the caption to see who the artist was. The answer: “AI art by Midjourney”

https://www.theguardian.com/commentisfree/2022/aug/20/ai-art...

John Naughton's bio says he's "professor of the public understanding of technology at the Open University". And he at first couldn't tell the image of Alex Jones illustrating the article was from MidJourney.


>I would bet most people can't tell what's been AI-generated from what hasn't.

This has literally never been the criteria for what art is.


Well if that isn't the criteria that's being asserted, then what is? Art has to be generated by a human? Art has to be "original"? Any definition you come up with is going to be wrought with counter-examples unless you make your definition so restrictive as to eliminate much of what typical people would call art, including many examples of fine art in art museums.


>Art has to be generated by a human? >eliminate much of what typical people would call art

Can you give me a single example of something that was produced pre-2021 and that the average person will consider art that was not "generated by a human"?


This was the topic of the parent comment. Also, there has never been AI art before now.


> there has never been AI art before now.

This is a bold claim. I disagree with it.

Here is a book from 2006 that already touches on genetic AI creatures evolving and similar topics as art.

https://mitpress.mit.edu/9780262731768/metacreation/


No, it wasn't the topic for the parent comment which tilts the definition in this direction: >they have no idea as to "what" they are painting.

You just shifted to "you can't distinguish" - a common tactic for wanna-be AI hypists.


Also, there has never been AI art before now.

There has however been machine generated art since at least the 50s. How is this not just the natural evolution of that?


I’m the GP, this wasn’t the topic of my comment above.


> If you look closely, these models make horrendous mistakes such as hair growing out of ear, uneven eyes, etc.,

12 months ago I'd have agreed. But these are from raw StableDiffusion, no face enhancement (unless noted). Just created these right now:

https://imgur.com/a/pGPokDq

https://imgur.com/a/yQ10JzB

https://imgur.com/a/dBtVtg3 (Note this has face enhancement on and I censored this so it is SFW)

https://imgur.com/a/TP24KRY

(Still not great at hands yet!)

The rest of your comment is just your misinterpretation of what the models are doing. Modern diffusion models do have large scale knowledge across the whole scene.


I mean what I now is that I've got a brand new benchmark for video card upgrades: next time I'm shopping "VRAM to get 1024x1024 resolution of Stable Diffusion" is a dominating factor.

All the current work is a shot across the bow that technology, libraries and research are orbiting some pretty serious utilities.

What does remain to be seen is whether we're in the exponential part of the S-curve or at the top for this particular set of models. But I suspect we'll see the space pretty thoroughly explored in the next 12 months - no large tech company will be taken seriously if it doesn't have a competent image generation model in the wild.


>no large tech company will be taken seriously if it doesn't have a competent image generation model in the wild

Wow, what a lot of bull. Defining a large tech company as 10k+ employees, I can assure you 90% of those won't ever touch image generation. Why would a company like SAP even consider this crap?


If a competitor does it, so will SAP. For instance, if Salesforce can generate meaningful thumbnails for leads or campaigns, etc., then SAP would have to match that.


Assuming this is even a feature users want (they probably don't; why would you have your ERP generate those instead of specialized software?), it makes 0 sense for SAP to build their own model instead of using an open-source one.


Stable Diffusion has been trained at 512x512 and doesn't work very well above this. But upscalers are ok and can even run on CPUs.


They literally lack basic relational understanding so

> knowledge across the whole scene

isn't really correct.

https://arxiv.org/pdf/2208.00005.pdf


Relational understanding is different to whole-scene knowledge.

Relational understanding is about translating the text to the scene.

Whole-scene knowledge is about making sure both eyes are the same color (for example).


I think only two types of ppl care about all that hype right now:

1) nerds

2) professionals that feel they will be negatively impacted by it

general public dont give a fuck


Unless one can give a precise definition of "nerds" that proves they are not, and can never be, part of the "general public", this is simply a circular argument (people who care are nerds, nerds aren't part of the general public, ergo the general public don't care).


there are tons of definitions out there that apply


> So to the extent that numbers somehow “exist” out there in reality apart from what we humans think or say about them, every digital file that you could put on a computer — every iTunes music download, every movie, every picture, every podcast — already “exists” on the regular old integer number line we all know from grade school.

I was once interested in the philosophy and history of numbers. The idea of coming up with counting symbols and concepts is deeper than what we might think. It's interesting to see what others think about the existence of numbers in the "real" world (pun not intended).


The weird part to me is that this makes me imagine a high-dimensional space where we can have some vector that lets you go from some starting position, like a picture of a muffin, and then take that in a "dog-like" direction.

And yet I've seen the weird "dog or muffin" [1] images that make you realize there's not as clean a separation as one might imagine with some pictures.

[1] See, e.g. https://www.freecodecamp.org/news/chihuahua-or-muffin-my-sea...


One of the issues with the current stable diffusion img2img implementation is that this is NOT what it does (but should!)

Instead it takes the image then tries to denoise it using the conditioned latents of the prompt, and the "strength" of the prompt determines how many denoising loops it does.


Could it be made to do that?


Definitely! The model accepts CLIP embeds, so you could average the prompt embedding with the image embedding - and then do some sort of top-k selection over a few of these image embeddings to generate "variations" of an image. I think this is how DALLE-2 does it.

Also, the model would benefit from being explicitly finetuned for the task of "inpainting". This is where you randomly erase parts of an image and force the model to predict what is missing based on the prompt. This allows one to edit existing images, filling in the gaps with what they need.

This is technically already supported - but uses a similar "hack" as described above by sampling some noise into an existing image and starting a little bit later in the denoising schedule. Since the model wasn't finetuned for this task - a more generic method is used that doesn't work as well.


I've made an attempt at image embedding averaging. That is insufficient in itself but it was a completely naive implementation.

It really needs to do the denoising from one image to another and vice-versa over a few iterations. Or that's what I'm working on next anyway.


How does denoising work, exactly and why does it seem necessary to get good results?


tl;dr it allows the model to focus on different frequencies of pixel-space at different time steps. Time step zero being the low frequency/low-detail features and time step 1000 being a completely “filled out”, detailed image including all the high frequency features.

Doing so in discretely many steps makes the problem tractable and differentiable.

Gross oversimplification, but that’s the gist.


Different frequencies? Does that mean it's doing something like an FFT?


Bit of a delayed response - but no (well, not necessarily, anyways). Operating in the frequency domain is a useful technique in machine learning. With this method however, the reason that different "timesteps" amount to different frequencies is because you start with:

0: full gaussian noise

1: less gaussian noise, more generated pixels

...

998: barely any noise, mostly generated pixels

999: no noise. generation complete.

It's actually the properties of gaussian noise that allow the network to learn the importance of the frequency domain indirectly. When there is a high frequency of noise, you can expect a low frequency in pixel space. When there is less noise, you can similarly expect the pixel space to now include both low _and_ high frequency details. By the end of denoising, the full range is meant to be learned (ideally, anyway).

Having said that - they may also use a conversion to fourier space - I'm aware that at least nvidia does something like this in StyleGAN3.


Why does the noise being Gaussian matter? Would it even work with other types of noise like Perlin or white noise?


“Rather, the model’s configuration of numerical weights stores different attributes and features of a training image — edges, shapes, colors, and even abstract concepts related to what’s depicted”

How is this kind of information extracted into the model in the first place? Let’s say I train the model on pictures of dogs. How does the information about the dog’s body parts, their look and proper orientation end up in the model?


> So to the extent that numbers somehow “exist” out there in reality apart from what we humans think or say about them, every digital file that you could put on a computer-

they don't exist, we can just map all of them to a set of numbers or anything.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: